Big Data Demystified PDF

2 Contents About the author Acknowledgements Introduction Part 1 Big data demystified 1 The story of big data 2 Artificial intelligence, machine learning and big data 3 Why is big data useful? 4 Use cases for (big) data analytics 5 Understanding the big data ecosystem Part 2 Making the big data ecosystem work for your organization 6 How big data can help guide your strategy 7 Forming your strategy for big data and data science 8 Implementing data science – analytics, algorithms and machine learning 9 Choosing your technologies 10 Building your team 11 Governance and legal compliance 12 Launching the ship – successful deployment in the organization References Glossary Index 3 About the author David Stephenson consults and speaks internationally in the fields of data science and big data analytics. He completed his PhD at Cornell University and was a professor at the University of Pennsylvania, designing and teaching courses for students in the engineering and Wharton business schools. David has nearly 20 years of industry experience across Europe and the United States, delivering analytic insights and tools that have guided $10+ billion in business decisions and serving as an expert advisor to top-tier investment, private equity and management consulting firms. He has led global analytics programmes for companies spanning six continents. David is from the USA but has been living in Amsterdam since 2006. More information and content are available on his company website at www.dsianalytics.com. 4 Acknowledgements I would like to thank Eloise Cook at Pearson for her valuable editorial guidance, Uri Weis for comments on the first draft of the text and Matt Gardner for comments on the initial chapters. Thanks also to my father for proofing and providing helpful comments on the entire text. Despite their best efforts, any remaining errors are my own. 5 Introduction You often hear the term ‘big data’, but do you really know what it is and why it’s important? Can it make a difference in your organization, improving results and bringing competitive advantage, and is it possible that not utilizing big data puts you at a significant competitive disadvantage? The goal of this book is to demystify the term ‘big data’ and to give practical ways for you to leverage this data using data science and machine learning. The term ‘big data’ refers to a new class of data: vast, rapidly accumulating quantities, which often do not fit a traditional structure. The term ‘big’ is an understatement that simply does not do justice to the complexity of the situation. The data we are dealing with is not only bigger than traditional data; it is fundamentally different, as a motorcycle is more than simply a bigger bicycle and an ocean is more than simply a deeper swimming pool. It brings new challenges, presents new opportunities, blurs traditional competitive boundaries and requires a paradigm shift related to how we draw tangible value from data. The ocean of data, combined with the technologies that have been developed to handle it, provide insights at enormous scale and have made possible a new wave of machine learning, enabling computers to drive cars, predict heart attacks better than physicians and master extremely complex games such as Go better than any human. Why is big data a game-changer? As we will see, it allows us to draw much deeper insights from our data, understanding what motivates our customers and what slows down our production lines. In real time, it enables businesses to simultaneously deliver highly personalized experiences to millions of global customers, and it provides the computational power needed for scientific endeavours to analyse billions of data points in fields such as cancer research, astronomy and particle physics. Big data provides both the data and the computational resources that have enabled the recent resurgence in artificial intelligence, particularly with advances in deep learning, a methodology that has recently been making global headlines. Beyond the data itself, researchers and engineers have worked over the past two decades to develop an entire ecosystem of hardware and software solutions for collecting, storing, processing and analysing this abundant data. I refer to these hardware and software tools together as the big data ecosystem. This ecosystem allows us to draw immense value from big data for applications in business, science and healthcare. But to use this data, you need to piece together the parts of the big data ecosystem that work best for your applications, and you need to 6 apply appropriate analytic methods to the data – a practice that has come to be known as data science. All in all, the story of big data is much more than simply a story about data and technology. It is about what is already being done in commerce, science and society and what difference it can make for your business. Your decisions must go further than purchasing a technology. In this book, I will outline tools, applications and processes and explain how to draw value from modern data in its many forms. Most organizations see big data as an integral part of their digital transformation. Many of the most successful organizations are already well along their way in applying big data and data science techniques, including machine learning. Research has shown a strong correlation between big data usage and revenue growth (50 per cent higher revenue growth1), and it is not unusual for organizations applying data science techniques to see a 10–20 per cent improvement in key performance indicators (KPIs). For organizations that have not yet started down the path of leveraging big data and data science, the number one barrier is simply not knowing if the benefits are worth the cost and effort. I hope to make those benefits clear in this book, along the way providing case studies to illustrate the value and risks involved. In the second half of this book, I’ll describe practical steps for creating a data strategy and for getting data projects done within your organization. I’ll talk about how to bring the right people together and create a plan for collecting and using data. I’ll discuss specific areas in which data science and big data tools can be used within your organization to improve results, and I’ll give advice on finding and hiring the right people to carry out these plans. I’ll also talk about additional considerations you’ll need to address, such as data governance and privacy protection, with a view to protecting your organization against competitive, reputational and legal risks. We’ll end with additional practical advice for successfully carrying out data initiatives within your organization. Overview of chapters Part 1: Big data demystified Chapter 1: The story of big data How big data developed into a phenomenon, why big data has become such an important topic over the past few years, where the data is coming from, who is 7 using it and why, and what has changed to make possible today what was not possible in the past. Chapter 2: Artificial intelligence, machine learning and big data A brief history of artificial intelligence (AI), how it relates to machine learning, an introduction to neural networks and deep learning, how AI is used today and how it relates to big data, and some words of caution in working with AI. Chapter 3: Why is big data useful? How our data paradigm is changing, how big data opens new opportunities and improves established analytic techniques, and what it means to be data-driven, including success stories and case studies. Chapter 4: Use cases for (big) data analytics An overview of 20 common business applications of (big) data, analytics and data science, with an emphasis on ways in which big data improves existing analytic methods. Chapter 5: Understanding the big data ecosystem Overview of key concepts related to big data, such as open-source code, distributed computing and cloud computing. Part 2: Making the big data ecosystem work for your organization Chapter 6: How big data can help guide your strategy Using big data to guide strategy based on insights into your customers, your product performance, your competitors and additional external factors. Chapter 7: Forming your strategy for big data and data science Step-by-step instructions for scoping your data initiatives based on business goals and broad stakeholder input, assembling a project team, determining the most relevant analytics projects and carrying projects through to completion. Chapter 8: Implementing data science – analytics, algorithms and machine learning Overview of the primary types of analytics, how to select models and databases, and the importance of agile methods to realize business value. 8 Chapter 9: Choosing your technologies Choosing technologies for your big data solution: which decisions you’ll need to make, what to keep in mind, and what resources are available to help make these choices. Chapter 10: Building your team The key roles needed in big data and data science programmes, and considerations for hiring or outsourcing those roles. Chapter 11: Governance and legal compliance Principles in privacy, data protection, regulatory compliance and data governance, and their impact from legal, reputational and internal perspectives. Discussions of PII, linkage attacks and Europe’s new privacy regulation (GDPR). Case studies of companies that have gotten into trouble from inappropriate use of data. Chapter 12: Launching the ship – successful deployment in the organization Case study of a high-profile project failure. Best practices for making data initiatives successful in your organization, including advice on making your organization more data-driven, positioning your analytics staff within your organization, consolidating data and using resources efficiently. 9 Part 1 10 Big data demystified 11 Chapter 1 12 The story of big data We’ve always struggled with storing data. Not long ago, our holidays were remembered at a cost of $1 per photo. We saved only the very best TV shows and music recitals, overwriting older recordings. Our computers always ran out of memory. Newer, cheaper technologies turned up the tap on that data flow. We bought digital cameras, and we linked our computers to networks. We saved more data on less expensive computers, but we still sorted and discarded information continuously. We were frugal with storage, but the data we stored was small enough to manage. Data started flowing thicker and faster. Technology made it progressively easier for anyone to create data. Roll film cameras gave way to digital video cameras, even on our smartphones. We recorded videos we never replayed. High-resolution sensors spread through scientific and industrial equipment. More documents were saved in digital format. More significantly, the internet began linking global data silos, creating challenges and opportunities we were illequipped to handle. The coup de grâce came with the development of crowdsourced digital publishing, such as YouTube and Facebook, which opened the portal for anyone with a connected digital device to make nearly unlimited contributions to the world’s data stores.But storage was only part of the challenge. While we were rationing our storage, computer scientists were rationing computer processing power. They were writing computer programs to solve problems in science and industry: helping to understand chemical reactions, predict stock market movements and minimize the cost of complicated resource scheduling problems. Their programs could take days or weeks to finish, and only the most wellendowed organizations could purchase the powerful computers needed to solve the harder problems. In the 1960s and again in the 1980s, computer scientists were building high hopes for advancements in the field of machine learning (ML), a type of artificial intelligence (AI), but their efforts stalled each time, largely due to limitations in data and technology. In summary, our ability to draw value from data was severely limited by the technologies of the twentieth century. 13 What changed towards the start of the twenty-first century? There were several key developments towards the start of the twenty-first century. One of the most significant originated in Google. Created to navigate the overwhelming data on the newly minted world wide web, Google was all about big data. Its researchers soon developed ways to make normal computers work together like supercomputers, and in 2003 they published these results in a paper which formed the basis for a software framework known as Hadoop. Hadoop became the bedrock on which much of the world’s initial big data efforts would be built. The concept of ‘big data’ incubated quietly in the technology sector for nearly a decade before becoming mainstream. The breakthrough into management circles seemed to happen around 2011, when McKinsey published their report, ‘Big data: The next frontier for innovation, competition, and productivity.’2 The first public talk I gave on big data was at a designated ‘big data’ conference in London the next year (2012), produced by a media company seizing the opportunity to leverage a newly trending topic. But even before the McKinsey paper, large data-driven companies such as eBay were already developing internal solutions for fundamental big data challenges. By the time of McKinsey’s 2011 publication, Hadoop was already five years old and the University of California at Berkeley had open-sourced their Spark framework, the Hadoop successor that leveraged inexpensive RAM to process big data much more quickly than Hadoop. Let’s look at why data has grown so rapidly over the past few years and why the topic ‘big data’ has become so prominent. Why so much data? The volume of data we are committing to digital memory is undergoing explosive growth for two reasons: 1. The proliferation of devices that generate digital data: ubiquitous personal computers and mobile phones, scientific sensors, and the literally billions of sensors across the expanding Internet of Things (IoT) (see Figure 1.1). 2. The rapidly plummeting cost of digital storage. The proliferation of devices that generate 14 digital data Technology that creates and collects data has become cheap, and it is everywhere. These computers, smartphones, cameras, RFID (radio-frequency identification), movement sensors, etc., have found their way into the hands of the mass consumer market as well as those of scientists, industries and governments. Sometimes we intentionally create data, such as when we take videos or post to websites, and sometimes we create data unintentionally, leaving a digital footprint on a webpage that we browse, or carrying smartphones that send geospatial information to network providers. Sometimes the data doesn’t relate to us at all, but is a record of machine activity or scientific phenomena. Let’s look at some of the main sources and uses of the data modern technology is generating. Figure 1.1 Number of IoT devices by category.3 Content generation and self-publishing What does it take to get your writing published? A few years ago, it took a printing press and a network of booksellers. With the internet, you only needed the skills to create a web page. Today, anyone with a Facebook or Twitter account can instantly publish content with worldwide reach. A similar story has played out for films and videos. Modern technology, particularly the internet, has completely changed the nature of publishing and has facilitated a massive growth in human-generated content. Self-publishing platforms for the masses, particularly Facebook, YouTube and Twitter, threw open the floodgates of mass-produced data. Anyone could easily post content online, and the proliferation of mobile devices, particularly those capable of recording and uploading video, further lowered the barriers. Since nearly everyone now has a personal device with a high-resolution video camera and continuous internet access, the data uploads are enormous. Even children easily upload limitless text or video to the public domain. 15 YouTube, one of the most successful self-publishing platforms, is possibly the single largest consumer of corporate data storage today. Based on previously published statistics, it is estimated that YouTube is adding approximately 100 petabytes (PB) of new data per year, generated from several hundred hours of video uploaded each minute. We are also watching a tremendous amount of video online, on YouTube, Netflix and similar streaming services. Cisco recently estimated that it would take more than 5 million years to watch the amount of video that will cross global IP (internet protocol) networks each month in 2020. Consumer activity When I visit a website, the owner of that site can see what information I request from the site (search words, filters selected, links clicked). The site can also use the JavaScript on my browser to record how I interact with the page: when I scroll down or hover my mouse over an item. Websites use these details to better understand visitors, and a site might record details for several hundred categories of online actions (searches, clicks, scrolls, hovers, etc.). Even if I never log in and the site doesn’t know who I am, the insights are valuable. The more information the site gathers about its visitor base, the better it can optimize marketing efforts, landing pages and product mix. Mobile devices produce even heavier digital trails. An application installed on my smartphone may have access to the device sensors, including GPS (global positioning system). Since many people always keep their smartphones near them, the phones maintain very accurate data logs of the location and activity cycles of their owner. Since the phones are typically in constant communication with cell towers and Wi-Fi routers, third parties may also see the owners’ locations. Even companies with brick-and-mortar shops are increasingly using signals from smartphones to track the physical movement of customers within their stores. Many companies put considerable effort into analysing these digital trails, particularly e-commerce companies wanting to better understand online visitors. In the past, these companies would discard most data, storing only the key events (e.g. completed sales), but many websites are now storing all data from each online visit, allowing them to look back and ask detailed questions. The scale of this customer journey data is typically several gigabytes (GB) per day for smaller websites and several terabytes (TB) per day for larger sites. We’ll return to the benefits of analysing customer journey data in later chapters. We are generating data even when we are offline, through our phone conversations or when moving past video cameras in shops, city streets, airports or roadways. Security companies and intelligence agencies rely heavily on such data. In fact, the largest consumer of data storage today is quite likely the United States’ National Security Agency (NSA). In August 2014, the NSA completed construction of a massive data centre in Bluffdale, Utah, codenamed Bumblehive, 16 at a cost somewhere between 1 and 2 billion dollars. Its actual storage capacity is classified, but the governor of Utah told reporters in 2012 that it would be, ‘the first facility in the world expected to gather and house a yottabyte’. Machine data and the Internet of Things (IoT) Machines never tire of generating data, and the number of connected machines is growing at a rapid pace. One of the more mind-blowing things you can do in the next five minutes is to check out Cisco’s Visual Networking Index™, which recently estimated that global IP traffic will reach over two zettabytes per year by 2020. We may hit a limit in the number of mobile phones and personal computers we use, but we’ll continue adding networked processors to devices around us. This huge network of connected sensors and processors is known as the Internet of Things (IoT). It includes the smart energy meters appearing in our homes, the sensors in our cars that help us drive and sometimes communicate with our insurance companies, the sensors deployed to monitor soil, water, fauna or atmospheric conditions, the digital control systems used to monitor and optimize factory equipment, etc. The number of such devices stood at approximately 5 billion in 2015 and has been estimated to reach between 20 and 50 billion by 2020. Scientific research Scientists have been pushing the boundaries of data transport and data processing technologies. I’ll start with an example from particle physics. Case study – The large hadron collider (particle physics) One of the most important recent events in physics was witnessed on 4 July 2012: the discovery of the Higgs boson particle, also known as ‘the god particle’. After 40 years of searching, researchers finally identified the particle using the Large Hadron Collider (LHC), the world’s largest machine4 (see Figure 1.2). The massive LHC lies within a tunnel 17 miles (27 km) in circumference, stretching over the Swiss–French border. Its 150 million sensors deliver data from experiments 30 million times per second. This data is further filtered to a few hundred points of interest per second. The total annual data flow reaches 50 PB, roughly the equivalent of 500 years of full HD-quality movies. It is the poster child of big data research in physics. 17 Figure 1.2 The world’s largest machine.5 Case study – The square kilometre array (astronomy) On the other side of the world lies the Australian Square Kilometre Array Pathfinder (ASKAP), a radio telescope array of 36 parabolic antennas, each 12 metres in diameter6 and spanning 4000 square metres. Twelve of the 36 antennas were activated in October 20167, and the full 36, when commissioned, are expected to produce data at a rate of over 7.5 TB per second8 (one month’s worth of HD movies per second). Scientists are planning a larger Square Kilometre Array (SKA), which will be spread over several continents and be 100 times larger than the ASKAP. This may be the largest single data collection device ever conceived. All of this new data presents abundant opportunities, but let’s return now to our fundamental problem, the cost of processing and storing that data. The plummeting cost of disk storage There are two main types of computer storage: disk (e.g. hard drive) and random access memory (RAM). Disk storage is like a filing cabinet next to your desk. There may be a lot of space, but it takes time to store and retrieve the information. RAM is like the space on top of your desk. There is less space, but you can grab what’s there very quickly. Both types of storage are important for handling big data. 18 Disk storage has been cheaper, so we put most data there. The cost of disk storage has been the limiting factor for data archiving. With a gigabyte (GB) of hard drive storage costing $200,000 in 1980, it’s not hard to understand why we stored so little. By 1990, the cost had dropped to $9000 per GB, still expensive but falling fast. By the year 2000, it had fallen to an amazing $10 per GB. This was a tipping point, as we’ll see. By 2017, a GB of hard drive storage cost less than 3 cents (see Figure 1.3). This drop in storage cost brought interesting consequences. It became cheaper to store useless data rather than take time to filter and discard it (think about all the duplicate photos you’ve never deleted). We exchanged the challenge of managing scarcity for the challenge of managing over-abundant data, a fundamentally different problem. This story repeats itself across business, science and nearly every sector that relies on digital data for decisions or operations. Figure 1.3 Historic cost of disk storage per GB (log scale).9 Online companies had previously kept a fraction of web data and discarded the rest. Now these companies are keeping all data: every search, scroll and click, stored with time stamps to allow future reconstruction of each customer visit, just in case the data might prove useful later. But exceptionally large hard drives were still exceptionally expensive, and many companies needed these. They could not simply buy additional smaller, inexpensive drives, as the data needed to be processed in a holistic manner (you can divide a load of bricks between several cars, but you need a truck to move a piano). For organizations to take full advantage of the drop in hard drive prices, they would need to find a way to make a small army of mid-sized hard drives operate together as if they were one very large hard drive. 19 Google’s researchers saw the challenge and the opportunity and set about developing the solution that would eventually become Hadoop. It was a way to link many inexpensive computers and make them function like a super computer. Their initial solution leveraged disk storage, but soon the attention turned to RAM, the faster but more expensive storage media. The plummeting cost of RAM Disk (hard drive) storage is great for archiving data, but it is slow, requiring time for computer processors to read and write the data as they process it. If you picture working at a very small desk next to an enormous filing cabinet, constantly retrieving and refiling papers to complete your work on this small desk, you’ll quickly realize the benefits of a larger desk. RAM storage is like that desk space. It’s much faster to work with, which is a significant benefit when processing the huge volumes of high-velocity data that the world was producing. But RAM is much more expensive than disk storage. Its price was also falling, but it had more distance to cover Figure 1.4 Historic cost of RAM per GB (log scale). Source: http://www.statisticbrain.com/average-historic-price-of-ram/ How much more expensive is RAM storage? In 1980, when a GB of hard drive cost $200k, a GB of RAM cost $6 million. By the year 2000, when hard drives were at $15 and could be used for scalable big data solutions, a GB of RAM was well above $1000, prohibitively expensive for large-scale applications (see Figure 1.4). By 2010, however, RAM had fallen to $12 per GB, the price at which disk storage had seen its tipping point back in 2000. It was time for Berkeley labs to release a new RAM-based big data framework. This computational framework, 20 which they called Spark, used large amounts of RAM to process big data up to 100 times faster than Hadoop’s MapReduce processing model. The plummeting cost of processing power The cost of computer processing has also plummeted, bringing new opportunities to solve really hard problems and to draw value from the massive amounts of new data that we have started collecting (see Figure 1.5). Figure 1.5 Historic cost of processing power (log scale).10 Why did big data become such a hot topic? Over the last 15 years, we’ve come to realize that big data is an opportunity rather than a problem. McKinsey’s 2011 report spoke directly to the CEOs, elaborating on the value of big data for five applications (healthcare, retail, manufacturing, the public sector and personal location data). The report predicted big data could raise KPIs by 60 per cent and estimated hundreds of billions of dollars of added value per sector. The term ‘big data’ became the buzzword heard around the world, drawn out of the corners of technology and cast into the executive spotlight. With so many people talking so much about a topic they so little understood, many quickly grew jaded about the subject. But big data became such a foundational concept that Gartner, which had added big data to their Gartner 21 Hype Cycle for Emerging Technologies in 2012, made the unusual decision to completely remove it from the Hype Cycle in 2015, thus acknowledging that big data had become so foundational as to warrant henceforth being referred to simply as ‘data’ (see Figure 1.6). Figure 1.6 Gartner Hype Cycle for Emerging Technologies, 2014. Organizations are now heavily dependent on big data. But why such widespread adoption? Early adopters, such as Google and Yahoo, risked significant investments in hardware and software development. These companies paved the way for others, demonstrating commercial success and sharing computer code. The second wave of adopters did much of the hardest work. They could benefit from the examples of the early adopters and leverage some shared code but still needed to make significant investments in hardware and develop substantial internal expertise. Today, we have reached a point where we have the role models and the tools for nearly any organization to start leveraging big data. Let’s start with looking at some role models who have inspired us in the journey. Successful big data pioneers Google’s first mission statement was ‘to organize the world’s information and make it universally accessible and useful.’ Its valuation of $23 million only eight years later demonstrated to the world the value of mastering big data. 22 It was Google that released the 2003 paper that formed the basis of Hadoop. In January 2006, Yahoo made the decision to implement Hadoop in their systems.11 Yahoo was also doing quite well in those days, with a stock price that had slowly tripled over the previous five years. Around the time that Yahoo was implementing Hadoop, eBay was working to rethink how it handled the volume and variety of its customer journey data. Since 2002, eBay had been utilizing a massively parallel processing (MPP) Teradata database for reporting and analytics. The system worked very well, but storing the entire web logs was prohibitively expensive on such a proprietary system. eBay’s infrastructure team worked to develop a solution combining several technologies and capable of storing and analysing tens of petabytes of data. This gave eBay significantly more detailed customer insights and played an important role in their platform development, translating directly into revenue gains. Open-source software has levelled the playing field for software developers Computers had become cheaper, but they still needed to be programmed to operate in unison if they were to handle big data (such as coordinating several small cars to move a piano, instead of one truck). Code needed to be written for basic functionality, and additional code needed to be written for more specialized tasks. This was a substantial barrier to any big data project, and it is where opensource software played such an important role. Open-source software is software which is made freely available for anyone to use and modify (subject to some restrictions). Because big data software such as Hadoop was open-sourced, developers everywhere could share expertise and build off each other’s code. Hadoop is one of many big data tools that have been open-sourced. As of 2017, there are roughly 100 projects related to big data or Hadoop in the Apache Software Foundation alone (we’ll discuss the Apache foundation later). Each of these projects solves a new challenge or solves an old challenge in a new way. For example, Apache Hive allows companies to use Hadoop as a large database, and Apache Kafka provides messaging between machines. New projects are continually being released to Apache, each one addressing a specific need and further lowering the barrier for subsequent entrants into the big data ecosystem. Keep in mind 23 Most of the technology you’ll need for extracting value from big data is already readily available. If you’re just starting out with big data, leverage as much existing technology as possible. Affordable hardware and open-sourced software were lowering the barrier for companies to start using big data. But the problem remained that buying and setting up computers for a big data system was an expensive, complicated and risky process, and companies were uncertain how much hardware to purchase. What they needed was access to computing resources without long-term commitment. Cloud computing has made it easy to launch and scale initiatives Cloud computing is essentially renting all or part of an offsite computer. Many companies are already using one or more public cloud services: AWS, Azure, Google Cloud, or a local provider. Some companies maintain private clouds, which are computing resources that are maintained centrally within the company and made available to business units on demand. Such private clouds allow efficient use of shared resources. Cloud computing can provide hardware or software solutions. Salesforce began in 1999 as a Software as a Service (SaaS), a form of cloud computing. Amazon Web Services (AWS) launched its Infrastructure as a Service (IaaS) in 2006, first renting storage and a few months later renting entire servers. Microsoft launched its cloud computing platform, Azure, in 2010, and Google launched Google Cloud in 2011. Cloud computing solved a pain point for companies uncertain of their computing and storage needs. It allowed companies to undertake big data initiatives without the need for large capital expenditures, and it allowed them to immediately scale existing initiatives up or down. In addition, companies could move the cost of big data infrastructure from CapEx to OpEx. The costs of cloud computing are falling, and faster networks allow remote machines to integrate seamlessly. Overall, cloud computing has brought agility to big data, making it possible for companies to experiment and scale without the cost, commitment and wait-time of purchasing dedicated computers. With scalable data storage and compute power in place, the stage was set for researchers to once again revisit a technology that had stalled in the 1960s and again in the 1980s: artificial intelligence. 24 Takeaways Modern technology has given us tools to produce much more digital information than ever before. The dramatic fall in the cost of digital storage allows us to keep virtually unlimited amounts of data. Technology pioneers have developed and shared software that enables us to create substantial business value from today’s data. Ask yourself How are organizations in your sector already using big data technologies? Consider your competitors as well as companies in other sectors. What data would be useful to you if you could store and analyse it as you’d like? Think, for example, of traffic to your website(s), audio and video recordings, or sensor readings. What is the biggest barrier to your use of big data: technology, skill sets or use-cases? 25 Chapter 2 26 Artificial intelligence, machine learning and big data On 11 May 1997, an IBM computer named Deep Blue made history by defeating Garry Kasparov, the reigning world chess champion, in a match in New York City. Deep Blue won using raw computing muscle, evaluating up to 200 million moves per second as it referred to a list of rules it had been programmed to follow. Its programmers even adjusted its programming between games. But Deep Blue was a one-trick pony, soon dismantled. Computers were far from outperforming humans at most elementary tasks or in more complicated games, such as the Chinese game of Go, where there are more possible game states than atoms in the universe (see Figure 2.1). Figure 2.1 A Go gameboard. Fast forward 19 years to a match in Seoul, Korea, when a program named AlphaGo defeated reigning world Go champion Lee Sedol. Artificial intelligence had not simply improved in the 19 years since Deep Blue, it had become fundamentally different. Whereas Deep Blue had improved through additional, 27 explicit instructions and faster processors, AlphaGo was learning on its own. It first studied expert moves and then it practised against itself. Even the developers of AlphaGo couldn’t explain the logic behind certain moves that it made. It had taught itself to make them. What are artificial intelligence and machine learning? Artificial intelligence (AI) is a broad term for when a machine can respond intelligently to its environment. We interact with AI in Apple’s Siri, Amazon’s Echo, self-driving cars, online chat-bots and gaming opponents. AI also helps in less obvious ways. It is filtering spam from our inboxes, correcting our spelling mistakes, and deciding what posts appear on top of our social media feeds. AI has a broad range of applications, including image recognition, natural language processing, medical diagnosis, robotic movements, fraud detection and much more. Machine learning (ML) is when a machine keeps improving its performance, even after you’ve stopped programming it. ML is what makes most AI work so well, especially when there is abundant training data. Deep Blue was rule-based. It was AI without machine learning. AlphaGo used machine learning and gained its proficiency by first training on a large dataset of expert moves and then playing additional games against itself to learn what did or didn’t work. Since machine learning techniques improve with more data, big data amplifies machine learning. Most AI headlines today and almost all the AI I’ll discuss in this book will be applications of machine learning. The origins of AI Researchers have been developing AI methods since the 1950s. Many techniques used today are several decades old, originating in the self-improving algorithms developed in the research labs of MIT’s Marvin Minsky and Stanford’s John McCarthy. AI and ML hit several false starts. Researchers had high expectations, but computers were limited and initial results were disappointing. By the early 1970s, what was termed ‘the first AI winter’ had set in, lasting through the end of the decade. Enthusiasm for AI resurfaced in the 1980s, particularly following industry success with expert systems. The US, UK and Japanese governments invested hundreds of millions of dollars in university and government research labs, while 28 corporations spent similar amounts on in-house AI departments. An industry of hardware and software companies grew to support AI. The AI bubble soon burst again. The supporting hardware market collapsed, expert systems became too expensive to maintain and extensive investments proved disappointing. In 1987, the US government drastically cut AI funding, and the second AI winter began. Why the recent resurgence of AI? AI picked up momentum again in the mid-1990s, partly due to the increasing power of supercomputers. Deep Blue’s 1997 chess victory was actually a rematch. It had lost 15 months earlier, after which IBM gave it a major hardware upgrade.12 With twice the processing power, it won the rematch using brute computational force. Although it had used specialized hardware, and although its application was very narrow, Deep Blue had demonstrated the increasing power of AI. Big data gave an even greater boost to AI with two key developments: 1. We started amassing huge amounts of data that could be used for machine learning. 2. We created software that would allow normal computers to work together with the power of a super-computer. Powerful machine learning methods could now run on affordable hardware and could feast on massive amounts of training data. As an indication of scale, ML applications today may run on networks of several hundred thousand machines. One especially well-publicized machine learning technique that is increasingly used today is artificial neural networks, a technique recently extended to larger (deeper) networks and branded as deep learning. This technique contributed to AlphaGo’s victory in 2016. Artificial neural networks and deep learning Artificial neural networks (ANN) have been around since the late 1950s. They are collections of very simple building blocks pieced together to form larger networks. Each block performs only a few basic calculations, but the whole network can be ‘trained’ to assist with complicated tasks: label photos, interpret documents, drive a car, play a game, etc. Figure 2.2 gives examples of ANN architectures. 29 Artificial neural networks are so named because of their similarity to the connected neurons within the animal brain. They function as pattern recognition tools, similar to the early layers of our mind’s visual cortex but not comparable with the parts of our brains that handle cognitive reasoning. Figure 2.2 Examples of artificial neural network architectures.13 The challenge of building an ANN is in choosing an appropriate network model (architecture) for the basic building blocks and then training the network for the desired task. The trained model is deployed to a computer, a smartphone or even a chip embedded within production equipment. There are an increasing number of tools available to facilitate this process of building, training and deploying ANNs. These include Caffe, developed by Berkley Vision Lab, TensorFlow, developed within Google and released to Apache in November 2015, Theano and others. ‘Training’ an ANN, involves feeding it millions of labelled examples. To train an ANN to recognize animals, for example, I need to show it millions of pictures and label the pictures with the names of the animals they contain. If all goes well, the trained ANN will then be able to tell me which animals appear in new, unlabelled photos. During the training, the network itself does not change, but the strength of the various connections between the ‘neurons’ is adjusted to make the model more accurate. Larger, more complex ANNs can produce models that perform better, but they can take much longer to train. The layered networks are now generally much deeper, hence the rebranding of ANNs as ‘deep learning’. Using them requires big data technologies. 30 ANNs can be applied to a broad range of problems. Before the era of big data, researchers would say that neural networks were ‘the second-best way to solve any problem’. This has changed. ANNs now provide some of the best solutions. In addition to improving image recognition, language translation and spam filtering, Google has incorporated ANNs into core search functionality with the implementation of RankBrain in 2015. RankBrain, a neural network for search, has proven to be the biggest improvement to ranking quality Google has seen in several years. It has, according to Google, become the third most important of the hundreds of factors that determine search ranking.14 Case study – The world’s premier image recognition challenge The premier challenge in image recognition is the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), in which global research teams compete to build machine learning programs to label over 14 million images. An ANN won the challenge for the first time in 2012, and in a very impressive way. Whereas the best classification error rate for earlier ML algorithms had been 26 per cent, the ANN had a classification error rate of only 15 per cent. ANNs won every subsequent competition. In 2014, the GoogLeNet program won with an error rate of only 6.7 per cent using an ANN with 22 layers and several million artificial neurons. This network was three times deeper than that of the 2012 winner, and, in comparison with the number of neurons in the animal brain, their network was slightly ahead of honey bees but still behind frogs. By 2016, the winning ML algorithm (CUImage) had reduced the classification error to under 3 per cent using an ensemble of AI methods, including an ANN with 269 layers (10 × deeper than the 2014 winner). How AI helps analyse big data Most big data is unstructured data, including images, text documents and web logs. We store these in raw form and extract detailed information when needed. Many traditional analytic methods rely on data that is structured into fields such as age, gender, address, etc. To better fit a model, we often create additional data fields, such as average spend per visit or time since last purchase, a process known as feature engineering. Certain AI methods do not require feature selection and are especially useful for data without clearly defined features. For example, an AI method can learn to 31 identify a cat in a photo just by studying photos of cats, without being taught concepts such as cat faces, ears or whiskers. Some words of caution Despite early enthusiasm, the AI we have today is still ‘narrow AI’. Each is only useful for the specific application for which it was designed and trained. Deep learning has brought marginal improvements to narrow AI, but what is needed for full AI is a substantially different tool set. Gary Marcus, a research psychologist at New York University and co-founder of Geometric Intelligence (later acquired by Uber), describes three fundamental problems with deep learning.15 Figure 2.3 Example of AI failure.16 Figure 2.4 Dog or ostrich? 32 1. There will always be bizarre results, particularly when there is insufficient training data. For example, even as AI achieves progressively more astounding accuracy in recognizing images, we continue to see photos tagged with labels bearing no resemblance to the photo, as illustrated in Figure 2.3 Or consider Figure 2.4 above. By modifying the image of a dog in the lefthand image in ways that the human eye cannot detect, researchers fooled the best AI program of 2012 into thinking the image on the right was an ostrich.17 2. It is very difficult to engineer deep learning processes. They are difficult to debug, revise incrementally and verify. 3. There is no real progress in language understanding or causal reasoning. A program can identify a man and a car in a photo, but it won’t wonder, ‘Hey, how is that man holding the car above his head?’ Keep in mind Artificial intelligence is still limited to addressing specific tasks with clear goals. Each application needs a specially designed and trained AI program. Remember also that AI is very dependent on large quantities of diverse, labelled data and that Al trained with insufficient data will make more mistakes. We’ve already seen self-driving cars make critical errors when navigating unusual (e.g. untrained) conditions. Our tolerance for inaccuracy in such applications is extremely low. AI often requires a value system. A self-driving car must know that running over people is worse than running off the road. Commercial systems must balance revenue and risk reduction with customer satisfaction. Applications of AI in medicine bring their own promises and pitfalls. A team at Imperial College London has recently developed AI that diagnoses pulmonary hypertension with 80 per cent accuracy, significantly higher than the 60 per cent accuracy typical among cardiologists. Application of such technology, though, brings before us some complicated issues, as we’ll discuss later.18 AI applications have captured headlines over the past few years, and will doubtless continue to do so. I’ll talk more about AI in Chapter 8, when I discuss choosing analytic models that fit your business challenges. But AI is just one of many analytic tools in our tool chest, and its scope is still limited. Let’s step back now and consider the bigger picture of how big data can bring value through a wider set of tools and in a broad range of applications. 33 Takeaways AI has been actively studied for 60 years, but has twice gone through winters of disillusionment. Much of AI involves machine learning, where the program self-learns from examples rather than simply following explicit instructions. Big data is a natural catalyst for machine learning. Deep learning, a modern enhancement of an older method known as neural networks, is used in much of today’s AI technology. AI programs are limited in scope and will always make non-intuitive errors. Ask yourself Where in your organization do you have large amounts of labelled data that could be used for training a machine learning program, such as recognizing patterns in pictures or text, or predicting next customer action based on previous actions? If you’ve already started an AI project in your organization, how much greater is its estimated return on investment (ROI) than its cost? If you multiply the estimated chance of success by the estimated ROI, you should get a number exceeding the estimated cost. 34 Chapter 3 35 Why is big data useful? ‘Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent’ —Greg Linden, pioneering data scientist at Amazon.19 The big data ecosystem fundamentally changes what you can do with data, and it fundamentally changes how you should think about data. Completely new ways to use data We are doing things today that we could not have done without big data technologies. Some of these applications are recreational, while some are foundational to our understanding of science and healthcare. Big data was what enabled scientists to collect and analyse the massive amounts of data that led to the discovery of the Higgs boson at Europe’s enormous CERN research facility in 2012. It is allowing astronomers to operate telescopes of unprecedented size. It has brought cancer research forward by decades.20 The quantity of training data and the technologies developed to process big data have together breathed new life into the field of artificial intelligence, enabling computers to win at Jeopardy (IBM’s Watson computer), master very complicated games (DeepMind’s AlphaGo) and recognize human speech better than professional transcriptionists (Microsoft Research).21 The ability of search engines to return relevant results from millions of sources relies on big data tools. Even the ability of mid-sized e-commerce sites to return relevant results from their own inventories relies on big data tools such as Solr or Elastic Search. Data and analytics were extremely useful before the recent explosion of data, and ‘small data’ will continue to be valuable. But some problems can only be solved using big data tools, and many can be solved better using big data tools. A new way of thinking about data Big data changes your data paradigm. Instead of rationing storage and discarding potentially valuable data, you retain all data and promote its use. By storing raw 36 data in data lakes, you keep all options for future questions and applications. Consider a simple illustration. Suppose I develop an interest in Tesla cars and decide to count the Teslas I see for one month. After the month, I have a number. But if someone asks me for details about colour, time of day, or perhaps another type of vehicle, I’ll need another month before I can give an answer. If I had instead kept a video camera on my car during the first month and had saved all my recordings, I could answer any new questions with data I already had. Following a data-driven approach W. Edward Deming, the American engineer who worked to re-invigorate Japanese industry in the 1950s, is often credited for the quote, ‘In God we trust; all others bring data.’ Whereas some organizations are led by the intuition of their leaders or diligently adhere to established practices, data-driven organizations prioritize data in making decisions and measuring success. Such a data-driven approach was instrumental in Bill bratton’s leadership of the NYPD during the 1990s, when he introduced the CompStat system to help reduce crime in New York City.22 In practice, we all operate using a blend of intuition, habit and data, but if you follow a data-driven approach, you will back up your intuition with data and actively develop the tools and talent required to analyse your data. Data insights Challenge your assumptions and ask for supporting data. For example, find data showing if your regular promotions are boosting revenue or are simply lossmakers. Track how customer segments respond to different product placements. Find out why they do or don’t come back. Case study – Tesco’s Clubcard Some organizations are natively data-driven. Others undergo a data transformation. British supermarket giant Tesco is an example of the latter. With the help of external analysts, Tesco experienced tremendous success adopting a data-driven approach to customer relations and marketing, fuelled by the data from their Tesco Clubcard. The chairman, Ian MacLaurin, amazed at the analysts’ insights, said, ‘You know more about my customers in three months than I know in 30 years’. This period of data-driven growth brought Tesco’s market share from 18 per cent in 1994 to 25 per cent in 2000, as shown in Figure 3.1.23, 24 Its management would later say that data had guided nearly all key business decisions during that time, reducing 37 the risk of launching bold initiatives, and providing an extremely clear sense of direction in decision making. Figure 3.1 Tesco share price (Clubcard launched Q1 1995).25 Analysis Some insights jump out from data. Others you’ll have to dig for, perhaps using statistical methods for forecasts or correlations. Our next case study illustrates such a process. Case study – Target’s marketing to expecting mothers In 2002, when big data technology was still incubating in Silicon Valley, Target Corporation was initiating a data-driven effort that would bring it significant revenue, along with a certain amount of unwelcomed publicity. Target, the second largest discount retailer in the United States, was struggling to gain market share from Walmart. Target had a brilliant idea, an idea that would require creative use of data. Professor Alan Andreasen had published a paper in the 1980s demonstrating that buying habits are more likely to change at major life events. For Target, the customer event with perhaps the greatest spending impact would be the birth of a child. Target launched a project to flag pregnant shoppers based on recent purchases, with the goal of marketing baby products to these shoppers at well-timed points in the pregnancy. Target’s analysts carefully studied all available data, including sales records, birth registries and third-party information. Within a few months, they had developed statistical models that could identify pregnant shoppers with high accuracy based solely on what products they were purchasing, even pinpointing their due dates to within a small window. 38 One year later, an angry father stormed into a Target store outside of Minneapolis, demanding to talk to a manager: ‘My daughter got this in the mail!’ he said. ‘She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?’ The father soon learned the girl was actually pregnant. The story made headlines, and the world marvelled at how Target had hit a financial gold mine and a PR landmine. Target saw 20 per cent year-over-year annual growth during these years (2002–2005), growth which they attributed to ‘heightened focus on items and categories that appeal to specific guest segments, such as mom and baby.’26 The world took notice. Target had adopted an analytics-driven approach to marketing, and it had resulted in substantial revenue growth. Neither the Target nor the Tesco case studies involved what we today call big data, but both brought double-digit growth rates. Target and Tesco took all the information in their systems and added data acquired from third parties. They placed this data in the hands of trained analysts and used the results to steer their operations. Such data-driven approaches are still bringing success to many companies today. What’s changed is that you now have access to new types of data and better tooling. Better data tooling Data brings insights. Your ability to archive and analyse so much potentially relevant data lets you find answers quickly and become extremely agile in your business planning. It’s disruptive technology. More data generally enables better analysis. It improves some analysis and completely transforms others. It’s like adding power tools to a set of hand tools. Some jobs you can do better, and some that were previously not feasible suddenly become feasible. In this next section, I’ll show some ways big data makes traditional analytics better. Data: the more the better You’ll want to collect as much data as possible to do your analysis: more types of data and greater quantities of each type. There is a fundamental principle of analytics that ‘more data beats better models’. The strength of your analysis depends on: 1. Discovering what data is most meaningful. 39 2. Selecting an analytic tool appropriate to the task. 3. Having enough data to make the analysis work. The reason you’ll want to develop big data capabilities is that big data gives you additional types of data (such as customer journey data) for the first dependency and additional quantities of data for the third dependency. Keep in mind Update your existing statistical and analytic models to incorporate new data sources, particularly big data such as web traffic, social media, customer support logs, audio and video recordings and various sensor data. Additional types of data To illustrate, imagine an insurer calculating the premium for your car insurance. If the insurer knows only your home address, age and car model, they can make a rough estimate of your risk level. Telling the insurer how far you drive each year would give more insight, as more driving means more risk. Telling where and when you drive would give even more insight into your risk. The insurance company will benefit more from getting the additional data than it would from improving its risk model with the original, limited data. In a similar way, big data provides additional types of data. It gives detailed sensor information to track product performance for machines. It allows us to record and analyse deceleration rates for cars equipped with monitoring devices. It allows us to manage massive volumes of audio and video data, social media activity and online customer journey data. The value of customer journey data Customer journey data is an extremely valuable type of big data. Tesco’s customer analysis in the late 1990s used demographic information (age, gender, family profile, address) and purchase data. This was a lot of data at the time, considering their limited storage media, and it was sufficient for insights into purchase patterns of customer segments. The resulting insights were valuable for marketing, product selection and pricing, but they gave a two-dimensional view of a three-dimensional world. Tesco only saw what happened when the customer reached the checkout queue. The data we have today is much richer. 40 Although traditional web analytics gives you a two-dimensional view, with summary statistics such as traffic volume and conversion events (e.g. purchases), the complete web logs (the big data) will tell you: What marketing effort sent each customer to your site: Facebook, Google, an email campaign, a paid advertisement? What was top-of-mind when each customer entered? You might see the link that brought them or the first search term used onsite. What is most important to each customer? You’ll see which filters the customer (de-) selects and the sort order chosen (increasing or decreasing price, rating, etc.). Knowing this can make a significant difference in how you approach each customer during the rest of their online visit. What alternate products each customer considered before making a purchase. With the online customer journey, you can analyse micro-conversions that signal when you’ve captured the customer’s interest. Particularly for expensive items with infrequent sales, you’ll want to understand how items are capturing the interest of your visitors, and you’ll use these insights in deciding how to sell to future visitors. How to create successful shopping experiences, based on what you learn about customer intention and preference. For example, you might learn that, for customers who entered your site looking for an android tablet, filtered for memory above 64GB, sorted based on decreasing price and then sorted by highest product review, the most commonly bought tablets were XXX and that certain other tablets were never purchased. You’ll see what additional items this type of customer often purchased. Using this knowledge, you can guide look-alike customers to quickly find the item or items that best suit them. If you ran a small shop and were on a first-name basis with each customer, you would already have such insights and would rely on them to improve your business. In e-commerce, with millions of unseen customers, recapturing this level of insight is extraordinary. We are not talking about invasive spying techniques. You can get valuable insights from studying even anonymous online customer journeys. Your stores of big data allow you to ask new questions from old data. When you notice a sales spike over the past quarter and wonder how this related to a certain popular item, you can search through detailed historic data to see which customer visits included searches or views of that item. This flexibility in after-the-fact analysis is only possible with big data solutions. In statistical analysis, as in the Target example, the customer journey data will provide new features for your analytic models. In the past, your models used customer age, income and location, but you can now add search terms and filters, search result orderings and item views. Knowing that a customer bought an 41 unscented hand cream was a signal of possible pregnancy for Target. Knowing that the customer specifically searched for hand cream that is unscented would have been an even stronger signal. Keep in mind If your website sees significant customer engagement, you should start using a big data system to store and analyse the detailed online activity. You’ll benefit from this analysis even if the visitors remain anonymous. Your detailed customer journey logs will accumulate at a rate of several gigabytes or even terabytes of unstructured data per day. You won’t use your traditional databases for this. We’ll talk more about selecting appropriate databases in Chapter 8. Additional quantities of data Some analytic models require very little data to work properly. (You need just two points to fit a line.) But many models, especially machine learning models, work much better as they are fed more data. Michele Banko and Eric Brill, researchers at Microsoft in 2001, demonstrated how certain machine learning methods never stopped benefitting from more data, even as they were gorged with extreme amounts.27 Such machine learning algorithms truly benefit from big data. The examples above focused heavily on retail applications. I’ll round out the chapter with a case study from medical research. Case study – Cancer research Big data is playing an increasingly important role in cancer research, both for storing and for analysing important genomic data. There are numerous applications, but I’ll briefly mention two: genomic storage and pathway analysis Every cancer is different, even for patients with the same type of cancer. A single tumour mass may have 100 billion cells, each mutating in a different way, so that studying only a sample of tumour cells will not give the complete picture of what is happening in that individual. Technology is making it possible for cancer researchers to record the data from more and more of those cancer cells. Since 2003, with the completion of the Human Genome Project, the cost of sequencing genomes has dropped dramatically, as shown in Figure 3.2. 42 The result is that we are building up a huge catalogue of genomic data, particularly related to cancer. Estimates are that scientists will soon be sequencing and storing more than an exabyte of genomic data every year. Big data technologies are also providing the tools for studying that data. Cancers are characterized by how they disrupt cell protein pathways, and these disruptions differ from patient to patient. To gain deeper insight into these patterns, researchers have developed a method where gene interaction networks are modelled as graphs of 25 thousand vertices and 625 million edges. Protein pathways then correspond to subnetworks in this graph. Researchers can identify connected subnetworks mutated in a significant number of patients using graph algorithms running on big data technologies (such as Flink). Such methods have already brought insights into ovarian cancer, acute myeloid leukaemia and breast cancer. Figure 3.2 Historic cost of sequencing a single human genome.28, 29 But not all applications of big data methods to cancer research have been successful, as we’ll see in a case study in Chapter 12. Takeaways Big data technologies enable you to bring business value from otherwise unmanageable data. Big data technologies allow you to operate in a much more data-driven manner. Big data opens the door to new analytic methods and makes traditional methods more accurate and insightful. Online customer journey is an example of big data that has proven valuable in many applications. 43 Big data has many applications to medical research Ask yourself When was the last time you uncovered an unexpected insight within your data? Do you have people and processes in place to promote data-driven insights? Which analytic techniques currently used within your organization could be improved by incorporating new data sources not available when those techniques were first built? What problems have you previously written off as ‘too difficult to solve’ because you didn’t have the necessary data or computing power? Which of these might you now be able to solve with big data technologies? 44 Chapter 4 45 Use cases for (big) data analytics In this chapter, I’ll cover important business applications of analytics, highlighting the enhancements of big data technologies, either by providing scalable computing power or through the data itself. It is not uncommon for these applications to raise KPIs by double digits. A/B testing In A/B testing, also called split testing, we test the impact of (typically small) product modifications. We divide customers into random groups and show each a different version. We run the test for a few weeks and then study the impact. Any attribute of your website can be tested in this way: arrangements, colours, fonts, picture sizes, etc. Companies run hundreds of A/B tests over the course of a year to find what best impacts total sales, bounce rate, conversion path length, etc. A/B testing is the life blood of online companies, allowing them to quickly and easily test ideas and ‘fail fast’, discarding what doesn’t work and finding what does. Beyond simply observing customer behaviour, A/B testing lets you take an active role in creating data and making causal statements. You’re not simply watching customers, you’re creating new digital products and seeing how the customers react. A/B testing can boost revenue by millions of dollars. A/B testing is not in itself a big data challenge, but coupling A/B testing with a big data application makes it much more effective. There are several reasons for this. By eliminating sampling, big data allows you to perform deep dives into your target KPIs, exploring results within very specific segments. To illustrate with a simplistic example, if you run an A/B test in Europe where the A variant is English text and the B variant is German text, the A variant would probably do better. When you dive deeper, splitting the results by visitor country, you get a truer picture. If you run an e-commerce platform with several hundred product categories and customers in several dozen countries, the variants of your A/B test will perform very differently by category and location. If you only study the summary test data, or if you only keep a small percentage of the test data (as was the standard practice in many companies), you’ll lose the quantity of data you would need to draw a meaningful conclusion when a product manager asks you about the performance of a certain product in a certain 46 market within a specific time window (for example, when a high-priced marketing campaign was run during a major network event). It is big data that gives you these valuable, detailed insights from A/B tests. The second way in which big data improves A/B testing is that, by allowing you to keep all the customer journey data for each testing session, it allows you to go beyond KPIs and begin asking nuanced questions regarding how test variants impacted customer journey. Once you have added a test variant ID to the big data customer journey storage, you can then ask questions such as ‘which variant had the shorter average length of path to purchase?’ or ‘in which variant did the customer purchase the most expensive product viewed?’ These detailed questions would not be possible in standard A/B implementations without big data. The third way, which we touched on in the last chapter, is that big data lets you answer new questions using data that you’ve already collected. Conjectures about user responses to product changes can sometimes be answered by looking to vast stores of historical data rather than by running new tests. To illustrate, imagine a company such as eBay is trying to understand how additional item photos might boost sales. They could test this by listing identical products for sale, differing only in the number of photos, and running this experiment for several weeks. If they instead used a big data system, they could immediately comb through the historical data and identify such pairs of products which had already been sold. Power sellers on a site such as eBay would have already run such selling experiments for their own benefit. eBay need only find these user-run experiments already stored in the big data storage system. In this way, the company gets immediate answers to their question without waiting for new test results. Recommendation engines/next best offer Recommendation engines have proven their worth for many companies. Netflix is the poster child for recommendation engines, having grown user base and engagement metrics not only by acquiring and producing video content, but also through personalized recommendations. In e-commerce, a key tactical capability is to recommend the products at the appropriate moments in a manner that balances a set of sometimes conflicting goals: customer satisfaction, maximum revenue, inventory management, future sales, etc. You must assess which product would most appeal to each customer, balanced against your own business goals, and you must present the product to the customer in a manner most likely to result in a purchase. 47 If you’re a publisher, you are also facing the challenge of recommending articles to your readers, making choices related to content, title, graphics and positioning of articles. Even starting with a specific market segment and category (world news, local news, gardening, property etc.), you need to determine the content and format that will most appeal to your readers. Case study – Predicting news popularity at The Washington Post30 The Washington Post is one of the few news agencies that have excelled in their creation of an online platform. Acquired by Amazon founder Jeff Bezos in 2013, it’s no surprise it has become innovative and data-driven. In fact, Digiday called The Post the most innovative publisher of 2015. By 2016, nearly 100 million global readers were accessing online content each month. The Post publishes approximately 1000 articles each day. With print, publishers choose content and layout before going to press and have very limited feedback into what works well. Online publishing provides new insights, allowing them to measure readers’ interactions with content in real time and respond by immediately updating, modifying or repositioning content. The millions of daily online visits The Post receives generate hundreds of millions of online interactions, which can immediately be used to steer publishing and advertising. The Post is also using this big data to predict article popularity, allowing editors to promote the most promising articles and enhance quality by adding links and supporting content. Importantly, they can monetize those articles more effectively. If the model predicts an article will not be popular, editors can modify headlines and images to increase success metrics such as views and social shares. The Post’s data-driven culture is paying off. In an age where traditional publishers are struggling to reinvent themselves, The Post recently reported a 46 per cent annual increase in online visitors and a 145 per cent increase in annual digital-only subscriptions.31 We see how the move to online provided insight when articles were being read and shared. The publisher could see which articles were clicked (thus demonstrating the power of the headline and photo), which were read to the end (based on scrolls and time on page) and which were shared on social media. This digital feedback enabled a feedback loop not possible in print. However, using the digital feedback effectively requires the publishers to turn to digital data solutions. As the data grows, accelerates and becomes more complex, the publisher needs advanced tools and techniques for digital insights. To illustrate, consider a publisher who knows the number of readers of certain articles, but wants to understand the sentiment of the readers. This publisher might start 48 collecting and analysing text data from mentions of articles on social media, using sentiment analysis and more complex AI techniques to understand an article’s reception and impact. For merchants, recommending grew more difficult. Placing an item for sale online made it easy to sell, but customers became faceless and often anonymous. As a merchant, you need to know what products customers are most likely to buy, and you need to know how to help them. Both require a continuous feedback cycle which is responsive to each question and action of the customer. When the customer enters the store, you form a sales strategy from first impressions. A young girl will likely buy different items than an older man. The first question from the customer will indicate their intention, and their response to the first items they see will give insights into their preferences. Recommendation engines typically use a blend of two methodologies. The first, called collaborative filtering, contributes a recommendation score based on past activity. The second, content-based filtering, contributes a score based on properties of the product. As an example, after I’ve watched Star Wars Episode IV, collaborative filtering would suggest Star Wars Episode V, since people who liked Episode IV typically like Episode V. Content-based filtering, however, would recommend Episode V because it has many features in common with Episode IV (producer, actors, genre, etc.). An unwatched, newly released movie would not be recommended by the collaborative algorithm but might be by the content-based algorithm. Big data is what makes recommendation engines work well. If you’re building a recommendation engine, you’ll want to calibrate it using abundant, detailed data, including browsing data, and this is provided by your big data stores. The big data ecosystem also provides you with the scalable computing power to run the machine learning algorithms behind your recommendation engines, whether they are crunching the numbers in daily batch jobs or performing real-time updates. Your recommendation engine will work best when it can analyse and respond to real-time user behaviour. This ability, at scale, is what the big data ecosystem provides. Your customers are continuously expressing preferences as they type search terms and subsequently select or ignore the results. The best solution is one that learns from these actions in real time. Forecasting: demand and revenue If your forecasting model was built without incorporating big data, it probably is a statistical model constructed from a few standard variables and calibrated using basic historic data. You may have built it using features such as geography, date, 49 trends and economic indicators. You may even be using weather forecasts if you are forecasting short-term demand and resulting revenue. Big data can sharpen your forecasting in a couple of ways. First, it gives you more tools for forecasting. You can keep using your standard statistical models, and you can also experiment using a neural network trained on a cluster of cloud-based graphical processing units (GPUs) and calibrated using all available data, not just a few pre-selected explanatory variables. Retailers are already using such a method to effectively forecast orders down to the item level. Second, big data will provide you with additional explanatory variables for feature engineering in your current forecasting models. For example, in addition to standard features such as date, geography, etc., you can incorporate features derived from big data stores. A basic example would be sales of large ticket items, where increasingly frequent product views would be a strong predictor of an impending sale. IT cost savings You can save significant IT costs by moving from proprietary technology to open-source big data technology for your enterprise storage needs. Open-source technologies run on commodity hardware can be 20–30 times cheaper per terabyte than traditional data warehouses.32 In many cases, expensive software licenses can be replaced by adopting open-source technologies. Be aware, though, that you’ll also need to consider the people cost involved with any migration. Marketing Marketing is one of the first places you should look for applying big data. In Dell’s 2015 survey,1 the top three big data use cases among respondents were all related to marketing. These three were: Better targeting of marketing efforts. Optimization of ad spending. Optimization of social media marketing. This highlights how important big data is for marketing. Consider the number of potential ad positions in the digital space. It’s enormous, as is the number of ways that you can compose (via keyword selection), purchase (typically through some bidding process) and place your digital advertisements. Once your 50 advertisements are placed, you’ll collect details of the ad placements and the click responses (often by placing invisible pixels on the web pages, collectively sending millions of messages back to a central repository). Once customers are engaged with your product, typically by visiting your website or interacting with your mobile application, they start to leave digital trails, which you can digest with traditional web analytics tools or analyse in full detail with a big data tool. Marketing professionals are traditionally some of the heaviest users of web analytics, which in turn is one of the first points of entry for online companies that choose to store and analyse full customer journey data rather than summarized or sampled web analytics data. Marketing professionals are dependent on the online data to understand the behaviour of customer cohorts brought from various marketing campaigns or keyword searches, to allocate revenue back to various acquisition sources, and to identify the points of the online journey at which customers are prone to drop out of the funnel and abandon the purchase process. Social media Social media channels can play an important role in helping you understand customers, particularly in real time. Consider a recent comScore report showing that social networking accounts for nearly one out of five minutes spent online in the US (see Figure 4.1) Figure 4.1 Share of the total digital time spent by content category. 51 Source: comScore Media Metrix Multi-Platform, US, Total Audience, December 2015.33 Social media gives insight into customer sentiment, keyword usage and campaign effectiveness, and can flag a PR crisis you need to address immediately. Social media data is huge and it moves fast. Consider Twitter, where 6000 tweets are created each second, totalling 200 billion tweets per year.34 You’ll want to consider a range of social channels, as each may play an important role in understanding your customer base, and each has its own mixture of images, links, tags and free text, appealing to slightly different customer segments and enabling different uses. Pricing You may be using one or more standard pricing methods in your organization. These methods are specialized to fit specific sectors and applications. Financial instruments are priced to prevent arbitrage, using formulas or simulations constructed from an underlying mathematical model of market rate movements. Insurance companies use risk- and cost-based models, which may also involve simulations to estimate the impact of unusual events. If you are employing such a simulation-based pricing method, the big data ecosystem provides you with a scalable infrastructure for fast Monte Carlo simulations (albeit with issues related to capturing correlations). If you are in commerce or travel, you may be using methods of dynamic pricing that involve modelling both the supply and the demand curves and then using experimental methods to model price elasticity over those two curves. In this case, big data provides you with the forecasting tools and methods mentioned earlier in this chapter, and you can use the micro-conversions in your customer journey data as additional input for understanding price elasticity. Customer retention/customer loyalty Use big data technologies to build customer loyalty in two ways. First, play defence by monitoring and responding to signals in social media and detecting warning signals based on multiple touch points in the omni-channel experience. I’ll illustrate such an omni-channel signal in the coming section on customer churn. In Chapter 6, I’ll also discuss an example of customer service initiated by video analysis, which is a specific technique for applying nontraditional data and AI to retain customers and build loyalty. 52 Second, play offense by optimizing and personalizing the customer experience you provide. Improve your product using A/B testing; build a recommendation engine to enable successful shopping experiences; and deliver customized content for each customer visit (constructed first using offline big data analytics and then implemented using streaming processing for real-time customization). Cart abandonment (real time) Roughly 75 per cent of online shopping carts are abandoned.35 Deploy an AI program that analyses customer behaviour leading up to the point of adding items to shopping carts. When the AI predicts that the customer is likely to not complete the purchase, it should initiate appropriate action to improve the likelihood of purchase. Conversion rate optimization Conversion rate optimization (CRO) is the process of presenting your product in a way that maximizes the number of conversions. CRO is a very broad topic and requires a multi-disciplinary approach. It is a mixture of art and science, of psychology and technology. From the technology side, CRO is aided by A/B testing, by relevant recommendations and pricing, by real-time product customization, by cart abandonment technologies, etc. Product customization (real time) Adjust the content and format of your website in real time based on what you’ve learned about the visitor and on the visitor’s most recent actions. You’ll know general properties of the visitor from past interactions, but you’ll know what they are looking for today based on the past few minutes or seconds. You’ll need an unsampled customer journey to build your customization algorithms and you’ll need streaming data technologies to implement the solution in real time. Retargeting (real time) Deploy an AI program to analyse the customer behaviour on your website in real time and estimate the probability the customer will convert during their next visit. Use this information to bid on retargeting slots on other sites that the customer subsequently visits. You should adjust your bidding prices immediately (a fraction of a second) rather than in nightly batches. 53 Fraud detection (real time) In addition to your standard approach to fraud detection using manual screening or automated rules-based methods, explore alternative machine learning methods trained on large data sets.36 The ability to store massive quantities of time series data provides both a richer training set as well as additional possibilities for features and scalable, real-time deployment using fast data methods (Chapter 5). Churn reduction You should be actively identifying customers at high risk of becoming disengaged from your product and then work to keep them with you. If you have a paid usage model, you’ll focus on customers at risk of cancelling a subscription or disengaging from paid usage. Since the cost of acquiring new customers can be quite high, the return on investment (ROI) on churn reduction can be significant. There are several analytic models typically used for churn analysis. Some models will estimate the survival rate (longevity) of your customer, while others are designed to produce an estimated likelihood of churn over a period (e.g. the next two months). Churn is typically a rare event, which makes it more difficult for you to calibrate the accuracy of your model and balance between false positives and false negatives. Carefully consider your tolerance for error in either direction, balancing the cost of labelling a customer as a churn potential and wasting money on mitigation efforts vs the cost of not flagging a customer truly at risk of churning and eventually losing the customer. These traditional churn models take as input all relevant and available features, including subscription data, billing history, and usage patterns. As you increase your data supply, adding customer journey data such as viewings of the Terms and Conditions webpage, online chats with customer support, records of phone calls to customer support, and email exchanges, you can construct a more complete picture of the state of the customer, particularly when you view these events as a sequence (e.g. receipt of a high bill, followed by contact with customer support, followed by viewing cancellation policy online). In addition to utilizing the additional data and data sources to improve the execution of the traditional models, consider using artificial intelligence models, particularly deep learning, to reduce churn. With deep learning models, you can work from unstructured data sources rather than focusing on pre-selecting features for the churn model. 54 Predictive maintenance If your organization spends significant resources monitoring and repairing machinery, you’ll want to utilize big data technologies to help with predictive maintenance, both to minimize wear and to avoid unexpected breakdowns. This is an important area for many industries, including logistics, utilities, manufacturing and agriculture, and, for many of them, accurately predicting upcoming machine failures can bring enormous savings. In some airlines, for example, maintenance issues have been estimated to cause approximately half of all technical flight delays. In such cases, gains from predictive maintenance can save tens of millions annually, while providing a strong boost to customer satisfaction. The Internet of Things (IoT) typically plays a strong role in such applications. As you deploy more sensors and feedback mechanisms within machine parts and systems, you gain access to a richer stream of real-time operational data. Use this not only to ensure reliability but also for tuning system parameters to improve productivity and extend component life. This streaming big data moves you from model-driven predictive maintenance to data-driven predictive maintenance, in which you continuously respond to realtime data. Whereas previously we may have predicted, detected and diagnosed failures according to a standard schedule, supplemented with whatever data was periodically collected, you should increasingly monitor systems in real time and adjust any task or parameter that might improve the overall efficiency of the system. Supply chain management If you’re managing a supply chain, you’ve probably seen the amount of relevant data growing enormously over the past few years. Over half of respondents in a recent survey of supply chain industry leaders37 indicated they already had or expected to have a petabyte of data within a single database. Supply chain data has become much broader than simply inventory, routes and destinations. It now includes detailed, near-continuous inventory tracking technology at the level of transport, container and individual items, in addition to real-time environmental data from sensors within transports. These same survey respondents indicated that the increased visibility into the movements of the supply chain was their most valuable application of big data technology, followed by an increased ability to trace the location of products. These were followed by the ability to harvest user sentiment from blogs, ratings, reviews and social media. Additional applications of value included streaming 55 monitoring of sensor readings (particularly for temperature), equipment functionality, and applications related to processing relevant voice, video and warranty data. Customer lifetime value (CLV) As you work to understand your marketing ROI and the cost of customer churn, you’ll want to analyse customer lifetime value (CLV), the total future value that a customer will bring to your organization. A basic CLV calculation (before discounting) would be (Annual profit from customer) × (Expected number of years the customer is active) − Cost of acquiring customer Estimating CLV for customer segments lets you better understand the ROI from acquisition efforts in each segment. If the expected profits don’t exceed the acquisition costs, you won’t want to pursue those customers. The accuracy of your CLV calculation increases with your ability to sub-segment customers and your ability to compute the corresponding churn rates. Your ability to mitigate churn and to further activate customers through cross-sell, upsell and additional conversion rate optimization will boost your CLV. Use available big data to produce the more refined customer segmentation. The additional data will primarily consist of digital activity (including acquisition source, webpage navigation, email open rates, content downloads and activity on social media) but for some industries may also include audio and video data produced by your customer. To illustrate, you may find that customers you acquire from social media referrals will remain with you longer than customers you acquire from price comparison sites. Lead scoring Lead scoring is the art/science/random guess whereby you rank your sales prospects in decreasing order of potential value. A 2012 study by Marketing Sherpa reported that only 21 per cent of B2B marketers were already using lead scoring,38 highlighting abundant room for growth. Use lead scoring to help your sales team prioritize their efforts, wasting less time on dead-end leads and using their time for high-potential prospects. You’ll borrow techniques you used in churn analysis and CLV to generate a lead score, 56 which multiplies the likelihood of lead conversion with the estimated CLV of the lead. For attempted cross-sell and up-sell to existing customers, start from the same sources of customer data. If the lead is not a current customer and conversions are infrequent, you’ll generally have much less data for them, so you’ll need to select and calibrate models that work with more limited data (e.g. machine learning models won’t generally work). Consider using AI methods to detect signals in audio and video records matched with sales events. If there is sufficient training data, these methods could be trained to automatically flag your high-potential sales prospects (in real time). We mention a very basic example of such a method in Chapter 6. Human resources (HR) If you work in HR, leverage the tools and methods for lead scoring, churn analysis and conversion rate optimization to find and attract the best candidates, reduce employee churn and improve KPIs related to productivity and employee satisfaction. Recruitment and human resource professionals examine similar data to understand and ultimately influence recruitment success, increase employee productivity and minimize regretted attrition. In addition to traditional HR data (demographics, application date, starting date, positions, salaries, etc.), leverage the new data becoming available to you, such as response patterns for different types of job postings, photos and videos of candidates, free text on CVs / interview notes / emails / manager reviews and any other digital records available, including activity on social media. Pay attention to privacy laws and to the privacy policies of your organization. The analytics on this data can provide valuable insights even without retaining personally identifiable information. It can be done not only at the level of individual employees but also at progressively aggregate levels: department, region and country. Sentiment analysis You can get insights into the intentions, attitudes and emotions of your customers by analysing their text, speech, video and typing rhythms, as well as from data returned by onsite monitors such as video cameras and infra-red monitors. 57 Always-up monitoring systems can give you public reaction to your marketing or news events. If you are concerned with security or fraud, you can use sentiment analysis to flag high-risk individuals at entrance points or during an application process, forwarding these cases to trained staff for manual evaluation. As with any AI, sentiment analysis will not be 100 per cent accurate, but it can prove invaluable in bringing trending opinions to your attention much more quickly than manual efforts, and in quickly combing through extensive and rapidly moving data to identify common themes. In addition, some systems can spot features and patterns more accurately than human observers. Keep in mind Big data technologies help you do many things better but are not a silver bullet. You should typically build your first solutions using traditional data, and then use big data to build even better solutions. So far, we’ve painted the big picture of big data and AI, and we’ve looked at several business applications. We end Part 1 of this book with a slightly more detailed look at the tools and technologies that make big data solutions possible. We’ll then move to Part 2, which focuses on the practical steps you can take to utilize big data within your organization. Takeaways We provide a brief overview of 20 applications of business analytics, some of which are incrementally improved and some significantly improved by big data technologies. Ask yourself Which of these twenty business applications are most important for your organization? For those already in use within your organization, where could you add additional data sources, particularly big data or omni-channel data? Which KPIs could significantly improve your results if they increased by 5 per cent? Consider that a concerted analytic effort should increase a wellmanaged KPI by 5 per cent and a poorly managed KPI by 20 per cent or more. 58 59 Chapter 5 60 Understanding the big data ecosystem What makes data ‘big’? When referring to data as ‘big data’, we should expect to have one or more of ‘the three Vs’ first listed in 2001 by Gartner’s Doug Laney: volume, velocity and variety. You might also see creative references to additional Vs, such as veracity.39 Volume refers to the sheer quantity of data that you store. If you store the names and addresses of your immediate family, that is data. If you store the names and addresses of everyone in your country, that is a lot of data (you might need to use a different program on your computer). If everyone in your country sends you their autobiography, that is big data. You would need to rethink how you store such data. I described earlier how the NSA recently completed a data centre that may reach ‘one yottabyte’ of storage40 and how YouTube is perhaps the largest non-government consumer of data storage today. This is thanks to over one billion YouTube users,41 half of whom are watching from mobile devices, and who, all told, are uploading new video content at such a rapid rate that the content uploaded on 15 March alone could include high-definition video of every single second of the life of Julius Caesar. The world continues to change rapidly, and scientists predict that we will soon be storing newly sequenced genomic data at a rate even greater than that of YouTube uploads.42 Case study – Genomic data Biologists may soon become the largest public consumers of data storage. With the cost of sequencing a human genome now under $1000, sequencing speeds at over 10,000 giga base pairs per week, and the creation of over 1000 genomic sequencing centres spread across 50 countries, we are now seeing a doubling of stored genomic data every 7 months. Researchers at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (CSHL) recently published a paper42 predicting that the field of genomics will soon become the world’s largest consumer of incremental storage. They predict that as many as 2 billion people will have their full genomes sequenced over the next ten years. 61 In addition, new genome sequencing technologies are revealing previously unimagined levels of genome variation, particularly in cancers, meaning that researchers may eventually sequence and store thousands of genomes per individual. Velocity refers to how rapidly data accumulates. Processing 100,000 product search requests on your webshop over the course of an hour is very different from processing those requests in a fraction of a second. Earlier I introduced the Square Kilometre Array (SKA), a next-generation radio telescope designed to have 50 times the sensitivity and 10,000 times the survey speed of other imaging instruments.43 Once completed, it will acquire an amazing 750 terabytes of sample image data per second.44 That data flow would fill the storage of an average laptop 500 times in the time it takes to blink and would be enough to fill every laptop in Paris in the span of a Parisian lunch break. When eBay first purchased its gold-standard, massively parallel database from Teradata in 2002, its storage capacity at that time would have been filled by this SKA data in under two seconds. Not every velocity challenge is a volume challenge. The SKA astronomers and the particle physicists at CERN discard most data after filtering it. Variety refers to the type and nature of the data. Your traditional customer data has set fields such as Name, Address and Phone Number, but data is often free text, visual data, sensor data, or some combination of data and time stamps, which together preserve a complex narrative. The systems you use to store and analyse such data need to be flexible enough to accommodate data whose exact form can’t be anticipated. We’ll talk about technologies that can handle such data in Chapter 8. The three Vs describe major challenges you’ll need to o

Big Data Demystified PDF

Document Details

Tags

Related

Summary

Full Transcript