Machine Learning and Security: Protecting Systems with Data and Algorithms, 1st Edition PDF
Document Details
Uploaded by Deleted User
2018
Clarence Chio and David Freeman
Tags
Related
- Mathematics and Statistical Foundations for Machine Learning (FIC 504, FIC 506, FIC 507), Part 1, PDF
- Mathematics and Statistical Foundations for Machine Learning (FIC 504), Data Science (FIC 506), Cyber Security (FIC 507) PDF
- Machine Learning Applications to Smart City PDF
- Escape Room: Artificial Intelligence Challenges PDF
- Quick Heal Technologies Log File PDF
- CYB. Defensive AI (Part 3) Master's in Artificial Intelligence PDF
Summary
This book explores the application of machine learning to security, examining how data and algorithms can protect systems. It details real-world uses including spam filtering and intrusion detection and discusses different machine learning algorithms.
Full Transcript
Machine Learning & Security PROTECTING SYSTEMS WITH DATA AND ALGORITHMS Clarence Chio & David Freeman Praise for Machine Learning and Security The future of security and safety online is going to be defined by the ability of defenders to deploy machine learning to...
Machine Learning & Security PROTECTING SYSTEMS WITH DATA AND ALGORITHMS Clarence Chio & David Freeman Praise for Machine Learning and Security The future of security and safety online is going to be defined by the ability of defenders to deploy machine learning to find and stop malicious activity at Internet scale and speed. Chio and Freeman have written the definitive book on this topic, capturing the latest in academic thinking as well as hard-learned lessons deploying ML to keep people safe in the field. —Alex Stamos, Chief Security Oicer, Facebook An excellent practical guide for anyone looking to learn how machine learning techniques are used to secure computer systems, from detecting anomalies to protecting end users. —Dan Boneh, Professor of Computer Science, Stanford University If you’ve ever wondered what machine learning in security looks like, this book gives you an HD silhouette. —Nwokedi C. Idika, PhD, Sotware Engineer, Google, Security & Privacy Organization Machine Learning and Security Protecting Systems with Data and Algorithms Clarence Chio and David Freeman Beijing Boston Farnham Sebastopol Tokyo Machine Learning and Security by Clarence Chio and David Freeman Copyright © 2018 Clarence Chio and David Freeman. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Courtney Allen Interior Designer: David Futato Production Editor: Kristen Brown Cover Designer: Karen Montgomery Copyeditor: Octal Publishing, Inc. Illustrator: Rebecca Demarest Proofreader: Rachel Head Tech Reviewers: Joshua Saxe, Hyrum Anderson, Indexer: WordCo Indexing Services, Inc. Jess Males, and Alex Pinto February 2018: First Edition Revision History for the First Edition 2018-01-26: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491979907 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning and Security, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-97990-7 [LSI] Table of Contents Preface....................................................................... xi 1. Why Machine Learning and Security?........................................... 1 Cyber Threat Landscape 3 The Cyber Attacker’s Economy 7 A Marketplace for Hacking Skills 7 Indirect Monetization 8 The Upshot 8 What Is Machine Learning? 9 What Machine Learning Is Not 10 Adversaries Using Machine Learning 11 Real-World Uses of Machine Learning in Security 12 Spam Fighting: An Iterative Approach 14 Limitations of Machine Learning in Security 23 2. Classifying and Clustering................................................... 25 Machine Learning: Problems and Approaches 25 Machine Learning in Practice: A Worked Example 27 Training Algorithms to Learn 32 Model Families 33 Loss Functions 35 Optimization 36 Supervised Classification Algorithms 40 Logistic Regression 40 Decision Trees 42 Decision Forests 45 Support Vector Machines 47 Naive Bayes 49 v k-Nearest Neighbors 52 Neural Networks 53 Practical Considerations in Classification 55 Selecting a Model Family 55 Training Data Construction 56 Feature Selection 59 Overfitting and Underfitting 61 Choosing Thresholds and Comparing Models 62 Clustering 65 Clustering Algorithms 65 Evaluating Clustering Results 75 Conclusion 77 3. Anomaly Detection......................................................... 79 When to Use Anomaly Detection Versus Supervised Learning 80 Intrusion Detection with Heuristics 81 Data-Driven Methods 82 Feature Engineering for Anomaly Detection 85 Host Intrusion Detection 85 Network Intrusion Detection 89 Web Application Intrusion Detection 92 In Summary 93 Anomaly Detection with Data and Algorithms 93 Forecasting (Supervised Machine Learning) 95 Statistical Metrics 106 Goodness-of-Fit 107 Unsupervised Machine Learning Algorithms 112 Density-Based Methods 116 In Summary 118 Challenges of Using Machine Learning in Anomaly Detection 119 Response and Mitigation 120 Practical System Design Concerns 121 Optimizing for Explainability 121 Maintainability of Anomaly Detection Systems 123 Integrating Human Feedback 123 Mitigating Adversarial Effects 123 Conclusion 124 4. Malware Analysis.......................................................... 125 Understanding Malware 126 Defining Malware Classification 128 Malware: Behind the Scenes 131 vi | Table of Contents Feature Generation 145 Data Collection 146 Generating Features 147 Feature Selection 171 From Features to Classification 174 How to Get Malware Samples and Labels 178 Conclusion 179 5. Network Traic Analysis.................................................... 181 Theory of Network Defense 183 Access Control and Authentication 183 Intrusion Detection 184 Detecting In-Network Attackers 185 Data-Centric Security 185 Honeypots 186 Summary 186 Machine Learning and Network Security 187 From Captures to Features 187 Threats in the Network 193 Botnets and You 197 Building a Predictive Model to Classify Network Attacks 203 Exploring the Data 205 Data Preparation 210 Classification 214 Supervised Learning 216 Semi-Supervised Learning 222 Unsupervised Learning 223 Advanced Ensembling 228 Conclusion 233 6. Protecting the Consumer Web............................................... 235 Monetizing the Consumer Web 236 Types of Abuse and the Data That Can Stop Them 237 Authentication and Account Takeover 237 Account Creation 243 Financial Fraud 248 Bot Activity 251 Supervised Learning for Abuse Problems 256 Labeling Data 256 Cold Start Versus Warm Start 258 False Positives and False Negatives 258 Multiple Responses 259 Table of Contents | vii Large Attacks 259 Clustering Abuse 260 Example: Clustering Spam Domains 261 Generating Clusters 262 Scoring Clusters 266 Further Directions in Clustering 271 Conclusion 272 7. Production Systems........................................................ 275 Defining Machine Learning System Maturity and Scalability 275 What’s Important for Security Machine Learning Systems? 277 Data Quality 277 Problem: Bias in Datasets 277 Problem: Label Inaccuracy 279 Solutions: Data Quality 279 Problem: Missing Data 280 Solutions: Missing Data 281 Model Quality 284 Problem: Hyperparameter Optimization 285 Solutions: Hyperparameter Optimization 285 Feature: Feedback Loops, A/B Testing of Models 289 Feature: Repeatable and Explainable Results 293 Performance 297 Goal: Low Latency, High Scalability 297 Performance Optimization 298 Horizontal Scaling with Distributed Computing Frameworks 300 Using Cloud Services 305 Maintainability 307 Problem: Checkpointing, Versioning, and Deploying Models 307 Goal: Graceful Degradation 308 Goal: Easily Tunable and Configurable 309 Monitoring and Alerting 310 Security and Reliability 311 Feature: Robustness in Adversarial Contexts 312 Feature: Data Privacy Safeguards and Guarantees 312 Feedback and Usability 313 Conclusion 314 8. Adversarial Machine Learning............................................... 315 Terminology 316 The Importance of Adversarial ML 317 Security Vulnerabilities in Machine Learning Algorithms 318 viii | Table of Contents Attack Transferability 320 Attack Technique: Model Poisoning 322 Example: Binary Classifier Poisoning Attack 325 Attacker Knowledge 330 Defense Against Poisoning Attacks 331 Attack Technique: Evasion Attack 333 Example: Binary Classifier Evasion Attack 334 Defense Against Evasion Attacks 339 Conclusion 340 A. Supplemental Material for Chapter 2......................................... 343 B. Integrating Open Source Intelligence......................................... 351 Index....................................................................... 355 Table of Contents | ix Preface Machine learning is eating the world. From communication and finance to transpor‐ tation, manufacturing, and even agriculture,1 nearly every technology field has been transformed by machine learning and artificial intelligence, or will soon be. Computer security is also eating the world. As we become dependent on computers for an ever-greater proportion of our work, entertainment, and social lives, the value of breaching these systems increases proportionally, drawing in an increasing pool of attackers hoping to make money or simply wreak mischief. Furthermore, as systems become increasingly complex and interconnected, it becomes harder and harder to ensure that there are no bugs or backdoors that will give attackers a way in. Indeed, as this book went to press we learned that pretty much every microprocessor currently in use is insecure.2 With machine learning offering (potential) solutions to everything under the sun, it is only natural that it be applied to computer security, a field which intrinsically pro‐ vides the robust data sets on which machine learning thrives. Indeed, for all the secu‐ rity threats that appear in the news, we hear just as many claims about how A.I. can “revolutionize” the way we deal with security. Because of the promise that it holds for nullifying some of the most complex advances in attacker competency, machine learning has been touted as the technique that will finally put an end to the cat-and- mouse game between attackers and defenders. Walking the expo floors of major secu‐ rity conferences, the trend is apparent: more and more companies are embracing the use of machine learning to solve security problems. Mirroring the growing interest in the marriage of these two fields, there is a corre‐ sponding air of cynicism that dismisses it as hype. So how do we strike a balance? 1 Monsanto, “How Machine Learning is Changing Modern Agriculture,” Modern Agriculture, September 13, 2017, https://modernag.org/innovation/machine-learning-changing-modern-agriculture/. 2 “Meltdown and Spectre,” Graz University of Technology, accessed January 23, 2018, https://spectreattack.com/. xi What is the true potential of A.I. applied to security? How can you distinguish the marketing fluff from promising technologies? What should I actually use to solve my security problems? The best way we can think of to answer these questions is to dive deep into the science, understand the core concepts, do lots of testing and experimen‐ tation, and let the results speak for themselves. However, doing this requires a work‐ ing knowledge of both data science and computer security. In the course of our work building security systems, leading anti-abuse teams, and speaking at conferences, we have met a few people who have this knowledge, and many more who understand one side and want to learn about the other. This book is the result. What’s In This Book? We wrote this book to provide a framework for discussing the inevitable marriage of two ubiquitous concepts: machine learning and security. While there is some litera‐ ture on the intersection of these subjects (and multiple conference workshops: CCS’s AISec, AAAI’s AICS, and NIPS’s Machine Deception), most of the existing work is academic or theoretical. In particular, we did not find a guide that provides concrete, worked examples with code that can educate security practitioners about data science and help machine learning practitioners think about modern security problems effectively. In examining a broad range of topics in the security space, we provide examples of how machine learning can be applied to augment or replace rule-based or heuristic solutions to problems like intrusion detection, malware classification, or network analysis. In addition to exploring the core machine learning algorithms and techni‐ ques, we focus on the challenges of building maintainable, reliable, and scalable data mining systems in the security space. Through worked examples and guided discus‐ sions, we show you how to think about data in an adversarial environment and how to identify the important signals that can get drowned out by noise. Who Is This Book For? If you are working in the security field and want to use machine learning to improve your systems, this book is for you. If you have worked with machine learning and now want to use it to solve security problems, this book is also for you. We assume you have some basic knowledge of statistics; most of the more complex math can be skipped upon your first reading without losing the concepts. We also assume familiarity with a programming language. Our examples are in Python and we provide references to the Python packages required to implement the concepts we discuss, but you can implement the same concepts using open source libraries in Java, Scala, C++, Ruby, and many other languages. xii | Preface Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Also used for commands and command- line output. Constant width bold Shows commands or other text that should be typed literally by the user. Also used for emphasis in command-line output. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip, suggestion, or general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/oreilly-mlsec/book-resources. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ Preface | xiii cant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Machine Learning and Security by Clarence Chio and David Freeman (O’Reilly). Copyright 2018 Clarence Chio and David Freeman, 978-1-491-97990-7.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected]. O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals. Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐ sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others. For more information, please visit http://oreilly.com/safari. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) O’Reilly Media has a web page for this book, where they list errata, examples, and any additional information. You can access this page at http://bit.ly/machineLearningAnd Security. The authors have created a website for the book at https://mlsec.net. To comment or ask technical questions about this book, send email to bookques‐ [email protected]. xiv | Preface For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments The authors thank Hyrum Anderson, Jason Craig, Nwokedi Idika, Jess Males, Andy Oram, Alex Pinto, and Joshua Saxe for thorough technical reviews and feedback on early drafts of this work. We also thank Virginia Wilson, Kristen Brown, and all the staff at O’Reilly who helped us take this project from concept to reality. Clarence thanks Christina Zhou for tolerating the countless all-nighters and week‐ ends spent on this book, Yik Lun Lee for proofreading drafts and finding mistakes in my code, Jarrod Overson for making me believe I could do this, and Daisy the Chi‐ huahua for being at my side through the toughest of times. Thanks to Anto Joseph for teaching me security, to all the other hackers, researchers, and training attendees who have influenced this book in one way or another, to my colleagues at Shape Security for making me a better engineer, and to Data Mining for Cyber Security speakers and attendees for being part of the community that drives this research. Most of all, thanks to my family in Singapore for supporting me from across the globe and ena‐ bling me to chase my dreams and pursue my passion. David thanks Deepak Agarwal for convincing me to undertake this effort, Dan Boneh for teaching me how to think about security, and Vicente Silveira and my colleagues at LinkedIn and Facebook for showing me what security is like in the real world. Thanks also to Grace Tang for feedback on the machine learning sections as well as the occasional penguin. And the biggest thanks go to Torrey, Elodie, and Phoebe, who put up with me taking many very late nights and a few odd excursions in order to complete this book, and never wavered in their support. Preface | xv CHAPTER 1 Why Machine Learning and Security? In the beginning, there was spam. As soon as academics and scientists had hooked enough computers together via the internet to create a communications network that provided value, other people real‐ ized that this medium of free transmission and broad distribution was a perfect way to advertise sketchy products, steal account credentials, and spread computer viruses. In the intervening 40 years, the field of computer and network security has come to encompass an enormous range of threats and domains: intrusion detection, web application security, malware analysis, social network security, advanced persistent threats, and applied cryptography, just to name a few. But even today spam remains a major focus for those in the email or messaging space, and for the general public spam is probably the aspect of computer security that most directly touches their own lives. Machine learning was not invented by spam fighters, but it was quickly adopted by statistically inclined technologists who saw its potential in dealing with a constantly evolving source of abuse. Email providers and internet service providers (ISPs) have access to a wealth of email content, metadata, and user behavior. Using email data, content-based models can be built to create a generalizable approach to recognize spam. Metadata and entity reputations can be extracted from emails to predict the likelihood that an email is spam without even looking at its content. By instantiating a user behavior feedback loop, the system can build a collective intelligence and improve over time with the help of its users. Email filters have thus gradually evolved to deal with the growing diversity of circum‐ vention methods that spammers have thrown at them. Even though 85% of all emails sent today are spam (according to one research group), the best modern spam filters block more than 99.9% of all spam, and it is a rarity for users of major email services 1 to see unfiltered and undetected spam in their inboxes. These results demonstrate an enormous advance over the simplistic spam filtering techniques developed in the early days of the internet, which made use of simple word filtering and email meta‐ data reputation to achieve modest results. The fundamental lesson that both researchers and practitioners have taken away from this battle is the importance of using data to defeat malicious adversaries and improve the quality of our interactions with technology. Indeed, the story of spam fighting serves as a representative example for the use of data and machine learning in any field of computer security. Today, almost all organizations have a critical reliance on technology, and almost every piece of technology has security vulnerabilities. Driven by the same core motivations as the spammers from the 1980s (unregulated, cost-free access to an audience with disposable income and private information to offer), mali‐ cious actors can pose security risks to almost all aspects of modern life. Indeed, the fundamental nature of the battle between attacker and defender is the same in all fields of computer security as it is in spam fighting: a motivated adversary is con‐ stantly trying to misuse a computer system, and each side races to fix or exploit the flaws in design or technique before the other uncovers it. The problem statement has not changed one bit. Computer systems and web services have become increasingly centralized, and many applications have evolved to serve millions or even billions of users. Entities that become arbiters of information are bigger targets for exploitation, but are also in the perfect position to make use of the data and their user bases to achieve better security. Coupled with the advent of powerful data crunching hardware and the development of more powerful data analysis and machine learning algorithms, there has never been a better time for exploiting the potential of machine learning in security. In this book, we demonstrate applications of machine learning and data analysis tech‐ niques to various problem domains in security and abuse. We explore methods for evaluating the suitability of different machine learning techniques in different scenar‐ ios, and focus on guiding principles that will help you use data to achieve better secu‐ rity. Our goal is not to leave you with the answer to every security problem you might face, but rather to give you a framework for thinking about data and security as well as a toolkit from which you can pick the right method for the problem at hand. The remainder of this chapter sets up context for the rest of the book: we discuss what threats modern computer and network systems face, what machine learning is, and how machine learning applies to the aforementioned threats. We conclude with a detailed examination of approaches to spam fighting, which provides a concrete example of applying machine learning to security that can be generalized to nearly any domain. 2 | Chapter 1: Why Machine Learning and Security? Cyber Threat Landscape The landscape of adversaries and miscreants in computer security has evolved over time, but the general categories of threats have remained the same. Security research exists to stymie the goals of attackers, and it is always important to have a good understanding of the different types of attacks that exist in the wild. As you can see from the Cyber Threat Taxonomy tree in Figure 1-1,1 the relationships between threat entities and categories can be complex in some cases. We begin by defining the principal threats that we will explore in the chapters that follow: Malware (or virus) Short for “malicious software,” any software designed to cause harm or gain unauthorized access to computer systems. Worm Standalone malware that replicates itself in order to spread to other computer systems. Trojan Malware disguised as legitimate software to avoid detection. Spyware Malware installed on a computer system without permission and/or knowledge by the operator, for the purposes of espionage and information collection. Key‐ loggers fall into this category. Adware Malware that injects unsolicited advertising material (e.g., pop ups, banners, vid‐ eos) into a user interface, often when a user is browsing the web. Ransomware Malware designed to restrict availability of computer systems until a sum of money (ransom) is paid. Rootkit A collection of (often) low-level software designed to enable access to or gain control of a computer system. (“Root” denotes the most powerful level of access to a system.) 1 Adapted from the European CSIRT Network project’s Security Incidents Taxonomy. Cyber Threat Landscape | 3 Backdoor An intentional hole placed in the system perimeter to allow for future accesses that can bypass perimeter protections. Bot A variant of malware that allows attackers to remotely take over and control computer systems, making them zombies. Botnet A large network of bots. Exploit A piece of code or software that exploits specific vulnerabilities in other software applications or frameworks. Scanning Attacks that send a variety of requests to computer systems, often in a brute-force manner, with the goal of finding weak points and vulnerabilities as well as infor‐ mation gathering. Sniing Silently observing and recording network and in-server traffic and processes without the knowledge of network operators. Keylogger A piece of hardware or software that (often covertly) records the keys pressed on a keyboard or similar computer input device. Spam Unsolicited bulk messaging, usually for the purposes of advertising. Typically email, but could be SMS or through a messaging provider (e.g., WhatsApp). Login attack Multiple, usually automated, attempts at guessing credentials for authentication systems, either in a brute-force manner or with stolen/purchased credentials. Account takeover (ATO) Gaining access to an account that is not your own, usually for the purposes of downstream selling, identity theft, monetary theft, and so on. Typically the goal of a login attack, but also can be small scale and highly targeted (e.g., spyware, social engineering). Phishing (aka masquerading) Communications with a human who pretends to be a reputable entity or person in order to induce the revelation of personal information or to obtain private assets. 4 | Chapter 1: Why Machine Learning and Security? Spear phishing Phishing that is targeted at a particular user, making use of information about that user gleaned from outside sources. Social engineering Information exfiltration (extraction) from a human being using nontechnical methods such as lying, trickery, bribery, blackmail, and so on. Incendiary speech Discriminatory, discrediting, or otherwise harmful speech targeted at an individ‐ ual or group. Denial of service (DoS) and distributed denial of service (DDoS) Attacks on the availability of systems through high-volume bombardment and/or malformed requests, often also breaking down system integrity and reliability. Advanced persistent threats (APTs) Highly targeted networks or host attack in which a stealthy intruder remains intentionally undetected for long periods of time in order to steal and exfiltrate data. Zero-day vulnerability A weakness or bug in computer software or systems that is unknown to the ven‐ dor, allowing for potential exploitation (called a zero-day attack) before the ven‐ dor has a chance to patch/fix the problem. Cyber Threat Landscape | 5 Figure 1-1. Cyber hreat Taxonomy tree 6 | Chapter 1: Why Machine Learning and Security? The Cyber Attacker’s Economy What drives attackers to do what they do? Internet-based criminality has become increasingly commercialized since the early days of the technology’s conception. The transformation of cyber attacks from a reputation economy (“street cred,” glory, mis‐ chief) to a cash economy (direct monetary gains, advertising, sale of private informa‐ tion) has been a fascinating process, especially from the point of view of the adversary. The motivation of cyber attackers today is largely monetary. Attacks on financial institutions or conduits (online payment platforms, stored value/gift card accounts, Bitcoin wallets, etc.) can obviously bring attackers direct financial gains. But because of the higher stakes at play, these institutions often have more advanced defense mechanisms in place, making the lives of attackers tougher. Because of the allure of a more direct path to financial yield, the marketplace for vulnerabilities tar‐ geting such institutions is also comparatively crowded and noisy. This leads miscre‐ ants to target entities with more relaxed security measures in place, abusing systems that are open by design and resorting to more indirect techniques that will eventually still allow them to monetize. A Marketplace for Hacking Skills The fact that darknet marketplaces and illegal hacking forums exist is no secret. Before the existence of organized underground communities for illegal exchanges, only the most competent of computer hackers could partake in the launching of cyber attacks and the compromising of accounts and computer systems. However, with the commoditization of hacking and the ubiquitization of computer use, lower- skilled “hackers” can participate in the ecosystem of cyber attacks by purchasing vul‐ nerabilities and user-friendly hacking scripts, software, and tools to engage in their own cyber attacks. The zero-day vulnerability marketplace has variants that exist both legally and ille‐ gally. Trading vulnerabilities and exploits can become a viable source of income for both security researchers and computer hackers.2 Increasingly, the most elite com‐ puter hackers are not the ones unleashing zero-days and launching attack campaigns. The risks are just too high, and the process of monetization is just too long and uncertain. Creating software that empowers the common script-kiddy to carry out the actual hacking, selling vulnerabilities on marketplaces, and in some cases even pro‐ viding boutique hacking consulting services promises a more direct and certain path to financial gain. Just as in the California Gold Rush of the late 1840s, merchants 2 Charlie Miller, “The Legitimate Vulnerability Market: Inside the Secretive World of 0-day Exploit Sales,” Pro‐ ceedings of the 6th Workshop on the Economics of Information Security (2007). The Cyber Attacker’s Economy | 7 providing amenities to a growing population of wealth-seekers are more frequently the receivers of windfalls than the seekers themselves. Indirect Monetization The process of monetization for miscreants involved in different types of computer attacks is highly varied, and worthy of detailed study. We will not dive too deep into this investigation, but we will look at a couple of examples of how indirect monetiza‐ tion can work. Malware distribution has been commoditized in a way similar to the evolution of cloud computing and Infrastructure-as-a-Service (IaaS) providers. The pay-per-install (PPI) marketplace for malware propagation is a complex and mature ecosystem, pro‐ viding wide distribution channels available to malware authors and purchasers.3 Bot‐ net rentals operate on the same principle as on-demand cloud infrastructure, with per-hour resource offerings at competitive prices. Deploying malware on remote servers can also be financially rewarding in its own different ways. Targeted attacks on entities are sometimes associated with a bounty, and ransomware distributions can be an efficient way to extort money from a wide audience of victims. Spyware can assist in the stealing of private information, which can then be sold in bulk on the same online marketplaces where the spyware is sold. Adware and spam can be used as a cheap way to advertise dodgy pharmaceuticals and financial instru‐ ments. Online accounts are frequently taken over for the purposes of retrieving some form of stored value, such as gift cards, loyalty points, store credit, or cash rewards. Stolen credit card numbers, Social Security numbers, email accounts, phone num‐ bers, addresses, and other private information can be sold online to criminals intent on identity theft, fake account creation, fraud, and so on. But the path to monetiza‐ tion, in particular when you have a victim’s credit card number, can be a long and complex one. Because of how easily this information is stolen, credit card companies, as well as companies that operate accounts with stored value, often engineer clever ways to stop attackers from monetizing. For instance, accounts suspected of having been compromised can be invalidated, or cashing out gift cards can require addi‐ tional authentication steps. The Upshot The motivations of cyber attackers are complex and the paths to monetization are convoluted. However, the financial gains from internet attacks can be a powerful motivator for technically skilled people, especially those in less-wealthy nations and 3 Juan Caballero et al., “Measuring Pay-per-Install: The Commoditization of Malware Distribution,” Proceedings of the 20th USENIX Conference on Security (2011). 8 | Chapter 1: Why Machine Learning and Security? communities. As long as computer attacks can continue to generate a non-negligible yield for the perpetrators, they will keep coming. What Is Machine Learning? Since the dawn of the technological age, researchers have dreamed of teaching com‐ puters to reason and make “intelligent” decisions in the way that humans do, by drawing generalizations and distilling concepts from complex information sets without explicit instructions. Machine learning refers to one aspect of this goal—specifically, to algorithms and processes that “learn” in the sense of being able to generalize past data and experien‐ ces in order to predict future outcomes. At its core, machine learning is a set of math‐ ematical techniques, implemented on computer systems, that enables a process of information mining, pattern discovery, and drawing inferences from data. At the most general level, supervised machine learning methods adopt a Bayesian approach to knowledge discovery, using probabilities of previously observed events to infer the probabilities of new events. Unsupervised methods draw abstractions from unlabeled datasets and apply these to new data. Both families of methods can be applied to problems of classiication (assigning observations to categories) or regres‐ sion (predicting numerical properties of an observation). Suppose that we want to classify a group of animals into mammals and reptiles. With a supervised method, we will have a set of animals for which we are definitively told their category (e.g., we are told that the dog and elephant are mammals and the alli‐ gator and iguana are reptiles). We then try to extract some features from each of these labeled data points and find similarities in their properties, allowing us to differenti‐ ate animals of different classes. For instance, we see that the dog and the elephant both give birth to live offspring, unlike the alligator and the iguana. The binary prop‐ erty “gives birth to live offspring” is what we call a feature, a useful abstraction for observed properties that allows us to perform comparisons between different obser‐ vations. After extracting a set of features that might help differentiate mammals and reptiles in the labeled data, we then can run a learning algorithm on the labeled data and apply what the algorithm learned to new, unseen animals. When the algorithm is presented with a meerkat, it now must classify it as either a mammal or a reptile. Extracting the set of features from this new animal, the algorithm knows that the meerkat does not lay eggs, has no scales, and is warm-blooded. Driven by prior obser‐ vations, it makes a category prediction that the meerkat is a mammal, and it is exactly right. In the unsupervised case, the premise is similar, but the algorithm is not presented with the initial set of labeled animals. Instead, the algorithm must group the different sets of data points in a way that will result in a binary classification. Seeing that most What Is Machine Learning? | 9 animals that don’t have scales do give birth to live offspring and are also warm- blooded, and most animals that have scales lay eggs and are cold-blooded, the algo‐ rithm can then derive the two categories from the provided set and make future predictions in the same way as in the supervised case. Machine learning algorithms are driven by mathematics and statistics, and the algo‐ rithms that discover patterns, correlations, and anomalies in the data vary widely in complexity. In the coming chapters, we go deeper into the mechanics of some of the most common machine learning algorithms used in this book. This book will not give you a complete understanding of machine learning, nor will it cover much of the mathematics and theory in the subject. What it will give you is critical intuition in machine learning and practical skills for designing and implementing intelligent, adaptive systems in the context of security. What Machine Learning Is Not Artiicial intelligence (AI) is a popular but loosely defined term that indicates algorith‐ mic solutions to complex problems typically solved by humans. As illustrated in Figure 1-2, machine learning is a core building block for AI. For example, self-driving cars must classify observed images as people, cars, trees, and so on; they must predict the position and speed of other cars; they must determine how far to rotate the wheels in order to make a turn. These classification and prediction problems are solved using machine learning, and the self-driving system is a form of AI. There are other parts of the self-driving AI decision engine that are hardcoded into rule engines, and that would not be considered machine learning. Machine learning helps us create AI, but is not the only way to achieve it. Figure 1-2. Artiicial intelligence as it relates to machine learning and deep learning 10 | Chapter 1: Why Machine Learning and Security? Deep learning is another popular term that is commonly conflated with machine learning. Deep learning is a strict subset of machine learning referring to a specific class of multilayered models that use layers of simpler statistical components to learn representations of data. “Neural network” is a more general term for this type of lay‐ ered statistical learning architecture that might or might not be “deep” (i.e., have many layers). For an excellent discussion of this topic, see Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press). Statistical analysis is a core part of machine learning: outputs of machine learning algorithms are often presented in terms of probabilities and confidence intervals. We will touch on some statistical techniques in our discussion of anomaly detection, but we will leave aside questions regarding experimentation and statistical hypothesis testing. For an excellent discussion of this topic, see Probability & Statistics for Engi‐ neers & Scientists by Ronald Walpole et al. (Prentice Hall). What Is AI? The definition of AI is a slightly more contentious topic than the definition of machine learning. Machine learning refers to statistical learning algorithms that are able to create generalizable abstractions (models) by seeing and dissecting a dataset. AI systems have been loosely defined to be machine-driven decision engines that can achieve near-human-level intelligence. How near does this intelligence have to be to human intelligence before we consider it to be AI? As you might imagine, differing expectations and definitions of the term make it quite difficult to draw universally agreeable boundaries around this. Adversaries Using Machine Learning Note that nothing prevents adversaries from taking advantage of machine learning to avoid detection and evade defenses. As much as the defenders can learn from the attacks and adjust their countermeasures accordingly, attackers can also learn the nature of defenses to their own benefit. Spammers have been known to apply poly‐ morphism (i.e., changing the appearance of content without changing its meaning) to their payloads to circumvent detection, or to probe spam filters by performing A/B tests on email content and learning what causes their click-through rates to rise and fall. Both good guys and bad guys use machine learning in fuzzing campaigns to speed up the process of finding vulnerabilities in software. Adversaries can even use machine learning to learn about your personality and interests through social media in order to craft the perfect phishing message for you. Finally, the use of dynamic and adaptive methods in the area of security always con‐ tains a certain degree of risk. Especially when explainability of machine learning pre‐ dictions is often lacking, attackers have been known to cause various algorithms to What Is Machine Learning? | 11 make erroneous predictions or learn the wrong thing.4 In this growing field of study called adversarial machine learning, attackers with varying degrees of access to a machine learning system can execute a range of attacks to achieve their ends. Chap‐ ter 8 is dedicated to this topic, and paints a more complete picture of the problems and solutions in this space. Machine learning algorithms are often not designed with security in mind, and are often vulnerable in the face of attempts made by a motivated adversary. Hence, it is important to maintain an awareness of such threat models when designing and build‐ ing machine learning systems for security purposes. Real-World Uses of Machine Learning in Security In this book, we explore a range of different computer security applications for which machine learning has shown promising results. Applying machine learning and data science to solve problems is not a straightforward task. Although convenient pro‐ gramming libraries remove some complexity from the equation, developers still need to make many decisions along the way. By going through different examples in each chapter, we will explore the most com‐ mon issues faced by practitioners when designing machine learning systems, whether in security or otherwise. The applications described in this book are not new, and you also can find the data science techniques we discuss at the core of many computer systems that you might interact with on a daily basis. We can classify machine learning’s use cases in security into two broad categories: pattern recognition and anomaly detection. The line differentiating pattern recognition and anomaly detection is sometimes blurry, but each task has a clearly distinguished goal. In pattern recognition, we try to discover explicit or latent characteristics hid‐ den in the data. These characteristics, when distilled into feature sets, can be used to teach an algorithm to recognize other forms of the data that exhibit the same set of characteristics. Anomaly detection approaches knowledge discovery from the other side of the same coin. Instead of learning specific patterns that exist within certain subsets of the data, the goal is to establish a notion of normality that describes most (say, more than 95%) of a given dataset. Thereafter, deviations from this normality of any sort will be detected as anomalies. It is common to erroneously think of anomaly detection as the process of recognizing a set of normal patterns and differentiating it from a set of abnormal patterns. Pat‐ terns extracted through pattern recognition must be strictly derived from the observed data used to train the algorithm. On the other hand, in anomaly detection 4 Ling Huang et al., “Adversarial Machine Learning,” Proceedings of the 4th ACM Workshop on Artiicial Intelli‐ gence and Security (2011): 43–58. 12 | Chapter 1: Why Machine Learning and Security? there can be an infinite number of anomalous patterns that fit the bill of an outlier, even those derived from hypothetical data that do not exist in the training or testing datasets. Spam detection is perhaps the classic example of pattern recognition because spam typically has a largely predictable set of characteristics, and an algorithm can be trained to recognize those characteristics as a pattern by which to classify emails. Yet it is also possible to think of spam detection as an anomaly detection problem. If it is possible to derive a set of features that describes normal traffic well enough to treat significant deviations from this normality as spam, we have succeeded. In actuality, however, spam detection might not be suitable for the anomaly detection paradigm, because it is not difficult to convince yourself that it is in most contexts easier to find similarities between spam messages than within the broad set of normal traffic. Malware detection and botnet detection are other applications that fall clearly in the category of pattern recognition, where machine learning becomes especially useful when the attackers employ polymorphism to avoid detection. Fuzzing is the process of throwing arbitrary inputs at a piece of software to force the application into an unintended state, most commonly to force a program to crash or be put into a vul‐ nerable mode for further exploitation. Naive fuzzing campaigns often run into the problem of having to iterate over an intractably large application state space. The most widely used fuzzing software has optimizations that make fuzzing much more efficient than blind iteration. Machine learning has also been used in such optimiza‐ tions, by learning patterns of previously found vulnerabilities in similar programs and guiding the fuzzer to similarly vulnerable code paths or idioms for potentially quicker results. For user authentication and behavior analysis, the delineation between pattern recog‐ nition and anomaly detection becomes less clear. For cases in which the threat model is clearly known, it might be more suitable to approach the problem through the lens of pattern recognition. In other cases, anomaly detection can be the answer. In many cases, a system might make use of both approaches to achieve better coverage. Net‐ work outlier detection is a classic example of anomaly detection because most net‐ work traffic follows strict protocols and normal behavior matches a set of patterns in form or sequence. Any malicious network activity that does not manage to masquer‐ ade well by mimicking normal traffic will be caught by outlier detection algorithms. Other network-related detection problems, such as malicious URL detection, can also be approached from the angle of anomaly detection. Access control refers to any set of policies governing the ability of system users to access certain pieces of information. Frequently used to protect sensitive information from unnecessary exposure, access control policies are often the first line of defense against breaches and information theft. Machine learning has gradually found its way into access control solutions because of the pains experienced by system users at the Real-World Uses of Machine Learning in Security | 13 mercy of rigid and unforgiving access control policies.5 Through a combination of unsupervised learning and anomaly detection, such systems can infer information access patterns for certain users or roles in an organization and engage in retaliatory action when an unconventional pattern is detected. Imagine, for example, a hospital’s patient record storage system, where nurses and medical technicians frequently need to access individual patient data but don’t neces‐ sarily need to do cross-patient correlations. Doctors, on the other hand, frequently query and aggregate the medical records of multiple patients to look for case similari‐ ties and diagnostic histories. We don’t necessarily want to prevent nurses and medical technicians from querying multiple patient records because there might be rare cases that warrant such actions. A strict rule-based access control system would not be able to provide the flexibility and adaptability that machine learning systems can provide. In the rest of this book, we dive deeper into a selection of these real-world applica‐ tions. We then will be able to discuss the nuances around applying machine learning for pattern recognition and anomaly detection in security. In the remainder of this chapter, we focus on the example of spam fighting as one that illustrates the core principles used in any application of machine learning to security. Spam Fighting: An Iterative Approach As discussed earlier, the example of spam fighting is both one of the oldest problems in computer security and one that has been successfully attacked with machine learn‐ ing. In this section, we dive deep into this topic and show how to gradually build up a sophisticated spam classification system using machine learning. The approach we take here will generalize to many other types of security problems, including but not limited to those discussed in later chapters of this book. Consider a scenario in which you are asked to solve the problem of rampant email spam affecting employees in an organization. For whatever reason, you are instructed to develop a custom solution instead of using commercial options. Provided with administrator access to the private email servers, you are able to extract a body of emails for analysis. All the emails are properly tagged by recipients as either “spam” or “ham” (non-spam), so you don’t need to spend too much time cleaning the data.6 Human beings do a good job at recognizing spam, so you begin by implementing a simple solution that approximates a person’s thought process while executing this task. Your theory is that the presence or absence of some prominent keywords in an 5 Evan Martin and Tao Xie, “Inferring Access-Control Policy Properties via Machine Learning,” Proceedings of the 7th IEEE International Workshop on Policies for Distributed Systems and Networks (2006): 235–238. 6 In real life, you will spend a large proportion of your time cleaning the data in order to make it available to and useful for your algorithms. 14 | Chapter 1: Why Machine Learning and Security? email is a strong binary indicator of whether the email is spam or ham. For instance, you notice that the word “lottery” appears in the spam data a lot, but seldom appears in regular emails. Perhaps you could come up with a list of similar words and per‐ form the classification by checking whether a piece of email contains any words that belong to this blacklist. The dataset that we will use to explore this problem is the 2007 TREC Public Spam Corpus. This is a lightly cleaned raw email message corpus containing 75,419 mes‐ sages collected from an email server over a three-month period in 2007. One-third of the dataset is made up of spam examples, and the rest is ham. This dataset was cre‐ ated by the Text REtrieval Conference (TREC) Spam Track in 2007, as part of an effort to push the boundaries of state-of-the-art spam detection. For evaluating how well different approaches work, we will go through a simple vali‐ dation process.7 We split the dataset into nonoverlapping training and test sets, in which the training set consists of 70% of the data (an arbitrarily chosen proportion) and the test set consists of the remaining 30%. This method is standard practice for assessing how well an algorithm or model developed on the basis of the training set will generalize to an independent dataset. The first step is to use the Natural Language Toolkit (NLTK) to remove morphologi‐ cal affixes from words for more flexible matching (a process called stemming). For instance, this would reduce the words “congratulations” and “congrats” to the same stem word, “congrat.” We also remove stopwords (e.g., “the,” “is,” and “are,”) before the token extraction process, because they typically do not contain much meaning. We define a set of functions8 to help with loading and preprocessing the data and labels, as demonstrated in the following code:9 import string import email import nltk punctuations = list(string.punctuation) stopwords = set(nltk.corpus.stopwords.words('english')) 7 This validation process, sometimes referred to as conventional validation, is not as rigorous a validation method as cross-validation, which refers to a class of methods that repeatedly generate all (or many) different possible splits of the dataset (into training and testing sets), performing validation of the machine learning prediction algorithm separately on each of these. The result of cross-validation is the average prediction accu‐ racy across these different splits. Cross-validation estimates model accuracy better than conventional valida‐ tion because it avoids the pitfall of information loss from a single train/test split that might not adequately capture the statistical properties of the data (this is typically not a concern if the training set is sufficiently large). Here we chose to use conventional validation for simplicity. 8 These helper functions are defined in the file chapter1/email_read_util.py in our code repository. 9 To run this code, you need to install the Punkt Tokenizer Models and the stopwords corpus in NLTK using the nltk.download() utility. Spam Fighting: An Iterative Approach | 15 stemmer = nltk.PorterStemmer() # Combine the different parts of the email into a flat list of strings def flatten_to_string(parts): ret = [] if type(parts) == str: ret.append(parts) elif type(parts) == list: for part in parts: ret += flatten_to_string(part) elif parts.get_content_type == 'text/plain': ret += parts.get_payload() return ret # Extract subject and body text from a single email file def extract_email_text(path): # Load a single email from an input file with open(path, errors='ignore') as f: msg = email.message_from_file(f) if not msg: return "" # Read the email subject subject = msg['Subject'] if not subject: subject = "" # Read the email body body = ' '.join(m for m in flatten_to_string(msg.get_payload()) if type(m) == str) if not body: body = "" return subject + ' ' + body # Process a single email file into stemmed tokens def load(path): email_text = extract_email_text(path) if not email_text: return [] # Tokenize the message tokens = nltk.word_tokenize(email_text) # Remove punctuation from tokens tokens = [i.strip("".join(punctuations)) for i in tokens if i not in punctuations] # Remove stopwords and stem tokens if len(tokens) > 2: return [stemmer.stem(w) for w in tokens if w not in stopwords] return [] 16 | Chapter 1: Why Machine Learning and Security? Next, we proceed with loading the emails and labels. This dataset provides each email in its own individual file (inmail.1, inmail.2, inmail.3, …), along with a single label file (full/index) in the following format: spam../data/inmail.1 ham../data/inmail.2 spam../data/inmail.3... Each line in the label file contains the “spam” or “ham” label for each email sample in the dataset. Let’s read the dataset and build a blacklist of spam words now:10 import os DATA_DIR = 'datasets/trec07p/data/' LABELS_FILE = 'datasets/trec07p/full/index' TRAINING_SET_RATIO = 0.7 labels = {} spam_words = set() ham_words = set() # Read the labels with open(LABELS_FILE) as f: for line in f: line = line.strip() label, key = line.split() labels[key.split('/')[-1]] = 1 if label.lower() == 'ham' else 0 # Split corpus into training and test sets filelist = os.listdir(DATA_DIR) X_train = filelist[:int(len(filelist)*TRAINING_SET_RATIO)] X_test = filelist[int(len(filelist)*TRAINING_SET_RATIO):] for filename in X_train: path = os.path.join(DATA_DIR, filename) if filename in labels: label = labels[filename] stems = load(path) if not stems: continue if label == 1: ham_words.update(stems) elif label == 0: spam_words.update(stems) else: continue 10 This example can be found in the Python Jupyter notebook chapter1/spam-ighting-blacklist.ipynb in our code repository. Spam Fighting: An Iterative Approach | 17 blacklist = spam_words - ham_words Upon inspection of the tokens in blacklist, you might feel that many of the words are nonsensical (e.g., Unicode, URLs, filenames, symbols, foreign words). You can remedy this problem with a more thorough data-cleaning process, but these simple results should perform adequately for the purposes of this experiment: greenback, gonorrhea, lecher,... Evaluating our methodology on the 22,626 emails in the testing set, we realize that this simplistic algorithm does not do as well as we had hoped. We report the results in a confusion matrix, a 2 × 2 matrix that gives the number of examples with given pre‐ dicted and actual labels for each of the four possible pairs: Predicted HAM Predicted SPAM Actual HAM 6,772 714 Actual SPAM 5,835 7,543 True positive: predicted spam + actual ham True negative: predicted ham + actual ham False positive: predicted spam + actual ham False negative: predicted ham + actual spam Converting this to percentages, we get the following: Predicted HAM Predicted SPAM Actual HAM 32.5% 3.4% Actual SPAM 28.0% 36.2% Classiication accuracy: 68.7% Ignoring the fact that 5.8% of emails were not classified because of preprocessing errors, we see that the performance of this naive algorithm is actually quite fair. Our spam blacklist technique has a 68.7% classification accuracy (i.e., total proportion of correct labels). However, the blacklist doesn’t include many words that spam emails use, because they are also frequently found in legitimate emails. It also seems like an impossible task to maintain a constantly updated set of words that can cleanly divide spam and ham. Maybe it’s time to go back to the drawing board. Next, you remember reading that one of the popular ways that email providers fought spam in the early days was to perform fuzzy hashing on spam messages and filter emails that produced a similar hash. This is a type of collaborative iltering that relies on the wisdom of other users on the platform to build up a collective intelligence that 18 | Chapter 1: Why Machine Learning and Security? will hopefully generalize well and identify new incoming spam. The hypothesis is that spammers use some automation in crafting spam, and hence produce spam messages that are only slight variations of one another. A fuzzy hashing algorithm, or more specifically, a locality-sensitive hash (LSH), can allow you to find approximate matches of emails that have been marked as spam. Upon doing some research, you come across datasketch, a comprehensive Python package that has efficient implementations of the MinHash + LSH algorithm11 to per‐ form string matching with sublinear query costs (with respect to the cardinality of the spam set). MinHash converts string token sets to short signatures while preserving qualities of the original input that enable similarity matching. LSH can then be applied on MinHash signatures instead of raw tokens, greatly improving perfor‐ mance. MinHash trades the performance gains for some loss in accuracy, so there will be some false positives and false negatives in your result. However, performing naive fuzzy string matching on every email message against the full set of n spam messages in your training set incurs either O(n) query complexity (if you scan your corpus each time) or O(n) memory (if you build a hash table of your corpus), and you decide that you can deal with this trade-off:12,13 from datasketch import MinHash, MinHashLSH # Extract only spam files for inserting into the LSH matcher spam_files = [x for x in X_train if labels[x] == 0] # Initialize MinHashLSH matcher with a Jaccard # threshold of 0.5 and 128 MinHash permutation functions lsh = MinHashLSH(threshold=0.5, num_perm=128) # Populate the LSH matcher with training spam MinHashes for idx, f in enumerate(spam_files): minhash = MinHash(num_perm=128) stems = load(os.path.join(DATA_DIR, f)) if len(stems) < 2: continue for s in stems: 11 See Chapter 3 in Mining of Massive Datasets, 2nd ed., by Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman (Cambridge University Press). 12 This example can be found in the Python Jupyter notebook chapter1/spam-ighting-lsh.ipynb in our code repository. 13 Note that we specified the MinHashLSH object’s threshold parameter as 0.5. This particular LSH implementa‐ tion uses Jaccard similarities between the MinHashes in your collection and the query MinHash, returning the list of objects that satisfy the threshold condition (i.e., Jaccard similarity score > 0.5). The MinHash algo‐ rithm generates short and unique signatures for a string by passing random permutations of the string through a hash function. Configuring the num_perm parameter to 128 means that 128 random permutations of the string were computed and passed through the hash function. In general, the more random permutations used in the algorithm, the higher the accuracy of the hash. Spam Fighting: An Iterative Approach | 19 minhash.update(s.encode('utf-8')) lsh.insert(f, minhash) Now it’s time to have the LSH matcher predict labels for the test set: def lsh_predict_label(stems): ''' Queries the LSH matcher and returns: 0 if predicted spam 1 if predicted ham −1 if parsing error ''' minhash = MinHash(num_perm=128) if len(stems) < 2: return −1 for s in stems: minhash.update(s.encode('utf-8')) matches = lsh.query(minhash) if matches: return 0 else: return 1 Inspecting the results, you see the following: Predicted HAM Predicted SPAM Actual HAM 7,350 136 Actual SPAM 2,241 11,038 Converting this to percentages, you get: Predicted HAM Predicted SPAM Actual HAM 35.4% 0.7% Actual SPAM 10.8% 53.2% Classiication accuracy: 88.6% That’s approximately 20% better than the previous naive blacklisting approach, and significantly better with respect to false positives (i.e., predicted spam + actual ham). However, these results are still not quite in the same league as modern spam filters. Digging into the data, you realize that it might not be an issue with the algorithm, but with the nature of the data you have—the spam in your dataset just doesn’t seem all that repetitive. Email providers are in a much better position to make use of collabo‐ rative spam filtering because of the volume and diversity of messages that they see. Unless a spammer were to target a large number of employees in your organization, there would not be a significant amount of repetition in the spam corpus. You need to 20 | Chapter 1: Why Machine Learning and Security? go beyond matching stem words and computing Jaccard similarities if you want a breakthrough. By this point, you are frustrated with experimentation and decide to do more research before proceeding. You see that many others have obtained promising results using a technique called Naive Bayes classiication. After getting a decent understanding of how the algorithm works, you begin to create a prototype solution. Scikit-learn provides a surprisingly simple class, sklearn.naive_bayes.Multino mialNB, that you can use to generate quick results for this experiment. You can reuse a lot of the earlier code for parsing the email files and preprocessing the labels. How‐ ever, you decide to try passing in the entire email subject and plain text body (separa‐ ted by a new line) without doing any stopword removal or stemming with NLTK. You define a small function to read all the email files into this text form:14,15 def read_email_files(): X = [] y = [] for i in xrange(len(labels)): filename = 'inmail.' + str(i+1) email_str = extract_email_text(os.path.join(DATA_DIR, filename)) X.append(email_str) y.append(labels[filename]) return X, y Then you use the utility function sklearn.model_selection.train_test_split() to randomly split the dataset into training and testing subsets (the argument ran dom_state=123 is passed in for the sake of result reproducibility): from sklearn.model_selection import train_test_split X, y = read_email_files() X_train, X_test, y_train, y_test, idx_train, idx_test = \ train_test_split(X, y, range(len(y)), train_size=TRAINING_SET_RATIO, random_state=2) Now that you have prepared the raw data, you need to do some further processing of the tokens to convert each email to a vector representation that MultinomialNB accepts as input. One of the simplest ways to convert a body of text into a feature vector is to use the bag-of-words representation, which goes through the entire corpus of documents and generates a vocabulary of tokens used throughout the corpus. Every word in the 14 This example can be found in the Python Jupyter notebook chapter1/spam-ighting-naivebayes.ipynb in our code repository. 15 It is a loose convention in machine learning code to choose lowercase variable names for single columns of values and uppercase variable names for multiple columns of values. Spam Fighting: An Iterative Approach | 21 vocabulary comprises a feature, and each feature value is the count of how many times the word appears in the corpus. For example, consider a hypothetical scenario in which you have only three messages in the entire corpus: tokenized_messages: { 'A': ['hello', 'mr', 'bear'], 'B': ['hello', 'hello', 'gunter'], 'C': ['goodbye', 'mr', 'gunter'] } # Bag-of-words feature vector column labels: # ['hello', 'mr', 'doggy', 'bear', 'gunter', 'goodbye'] vectorized_messages: { 'A': [1,1,0,1,0,0], 'B': [2,0,0,0,1,0], 'C': [0,1,0,0,1,1] } Even though this process discards seemingly important information like the order of words, content structure, and word similarities, it is very simple to implement using the sklearn.feature_extraction.CountVectorizer class: from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X_train_vector = vectorizer.fit_transform(X_train) X_test_vector = vectorizer.transform(X_test) You can also try using the term frequency/inverse document frequency (TF/IDF) vec‐ torizer instead of raw counts. TF/IDF normalizes raw word counts and is in general a better indicator of a word’s statistical importance in the text. It is provided as sklearn.feature_extraction.text.TfidfVectorizer. Now you can train and test your multinomial Naive Bayes classifier: from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score # Initialize the classifier and make label predictions mnb = MultinomialNB() mnb.fit(X_train_vector, y_train) y_pred = mnb.predict(X_test_vector) # Print results print('Accuracy {:.3f}'.format(accuracy_score(y_test, y_pred))) > Accuracy: 0.956 22 | Chapter 1: Why Machine Learning and Security? An accuracy of 95.6%—a whole 7% better than the LSH approach!16 That’s not a bad result for a few lines of code, and it’s in the ballpark of what modern spam filters can do. Some state-of-the-art spam filters are in fact actually driven by some variant of Naive Bayes classification. In machine learning, combining multiple independent classifiers and algorithms into an ensemble (also known as stacked generalization or stacking) is a common way of taking advantage of each method’s strengths. So, you can imagine how a combination of word blacklists, fuzzy hash matching, and a Naive Bayes model can help to improve this result. Alas, spam detection in the real world is not as simple as we have made it out to be in this example. There are many different types of spam, each with a different attack vec‐ tor and method of avoiding detection. For instance, some spam messages rely heavily on tempting the reader to click links. The email’s content body thus might not con‐ tain as much incriminating text as other kinds of spam. This kind of spam then might try to circumvent link-spam detection classifiers using complex methods like cloak‐ ing and redirection chains. Other kinds of spam might just rely on images and not rely on text at all. For now, you are happy with your progress and decide to deploy this solution. As is always the case when dealing with human adversaries, the spammers will eventually realize that their emails are no longer getting through and might act to avoid detec‐ tion. This response is nothing out of the ordinary for problems in security. You must constantly improve your detection algorithms and classifiers and stay one step ahead of your adversaries. In the following chapters, we explore how machine learning methods can help you avoid having to be constantly engaged in this whack-a-mole game with attackers, and how you can create a more adaptive solution to minimize constant manual tweaking. Limitations of Machine Learning in Security The notion that machine learning methods will always give good results across differ‐ ent use cases is categorically false. In real-world scenarios there are usually factors to optimize for other than precision, recall, or accuracy. As an example, explainability of classification results can be more important in some applications than others. It can be considerably more difficult to extract the reasons for a decision made by a machine learning system compared to a simple rule. Some 16 In general, using only accuracy to measure model prediction performance is crude and incomprehensive. Model evaluation is an important topic that we discuss further in Chapter 2. Here we opt for simplicity and use accuracy as an approximate measure of performance. The sklearn.metrics.classification_report() method provides the precision, recall, f1-score, and support for each class, which can be used in combination to get a more accurate picture of how the model performs. Limitations of Machine Learning in Security | 23 machine learning systems might also be significantly more resource intensive than other alternatives, which can be a dealbreaker for execution in constrained environ‐ ments such as embedded systems. There is no silver bullet machine learning algorithm that works well across all prob‐ lem spaces. Different algorithms vary vastly in their suitability for different applica‐ tions and different datasets. Although machine learning methods contribute to the notion of artificial intelligence, their capabilities can still only be compared to human intelligence along certain dimensions. The human decision-making process is informed by a vast body of context drawn from cultural and experiential knowledge. This process is very difficult for machine learning systems to emulate. Take the initial blacklisted-words approach that we used for spam filtering as an example. When a person evaluates the content of an email to determine if it’s ham or spam, the decision-making process is never as simple as look‐ ing for the existence of certain words. The context in which a blacklisted word is being used can result in it being a reasonable inclusion in non-spam email. Also, spammers might use synonyms of blacklisted words in future emails to convey the same meaning, but a simplistic blacklist would not adapt appropriately. The system simply doesn’t have the context that a human has—it does not know what relevance a particular word bears to the reader. Continually updating the blacklist with new sus‐ picious words is a laborious process, and in no way guarantees perfect coverage. Even though your machine-learned model may work perfectly on a training set, you might find that it performs badly on a testing set. A common reason for this problem is that the model has overit its classification boundaries to the training data, learning characteristics of the dataset that do not generalize well across other unseen datasets. For instance, your spam filter might learn from a training set that all emails contain‐ ing the words “inheritance” and “Nigeria” can immediately be given a high suspicion score, but it does not know about the legitimate email chain discussion between employees about estate inheritances in Nigerian agricultural insurance schemes. With all these limitations in mind, we should approach machine learning with equal parts of enthusiasm and caution, remembering that not everything can instantly be made better with AI. 24 | Chapter 1: Why Machine Learning and Security? CHAPTER 2 Classifying and Clustering In this chapter, we discuss the most useful machine learning techniques for security applications. After covering some of the basic principles of machine learning, we offer up a toolbox of machine learning algorithms that you can choose from when approaching any given security problem. We have tried to include enough detail about each technique so that you can know when and how to use it, but we do not attempt to cover all the nuances and complexities of the algorithms. This chapter has more mathematical detail than the rest of the book; if you want to skip the details and begin trying out the techniques, we recommend you read the sec‐ tions “Machine Learning in Practice: A Worked Example” on page 27 and “Practical Considerations in Classification” on page 55 and then look at a few of the most popu‐ lar supervised and unsupervised algorithms: logistic regression, decision trees and forests, and k-means clustering. Machine Learning: Problems and Approaches Suppose that you are in charge of computer security for your company. You install firewalls, hold phishing training, ensure secure coding practices, and much more. But at the end of the day, all your CEO cares about is that you don’t have a breach. So, you take it upon yourself to build systems that can detect and block malicious traffic to any attack surface. Ultimately, these systems must decide the following: For every file sent through the network, does it contain malware? For every login attempt, has someone’s password been compromised? For every email received, is it a phishing attempt? For every request to your servers, is it a denial-of-service (DoS) attack? 25 For every outbound request from your network, is it a bot calling its command- and-control server? These tasks are all classiication tasks—binary decisions about the nature of the observed event. Your job can thus be rephrased as follows: Classify all events in your network as malicious or legitimate. When phrased in this manner, the task seems almost hopeless; how are you supposed to classify all traffic? But not to fear! You have a secret weapon: data. Specifically, you have historical logs of binary files, login attempts, emails received, and inbound and outbound requests. In some cases, you might even know of attacks in the past and be able to associate these attacks with the corresponding events in your logs. Now, to begin solving your problem, you look for patterns in the past data that seem to indicate malicious attacks. For example, you observe that when a single IP address is making more than 20 requests per second to your servers over a period of 5 minutes, it’s probably a DoS attack. (Maybe your servers went down under such a load in the past.) After you have found patterns in the data, the next step is to encode these patterns as an algorithm—that is, a function that takes as input data about whatever you’re trying to classify and outputs a binary response: “malicious” or “legitimate.” In our example, this algorithm would be very simple:1 it takes as input the number of requests from an IP address over the 5 minutes prior to the request, and outputs “legitimate” if the number is less than 6,000 and “malicious” if it is greater than 6,000. At this point, you have learned from the data and created an algorithm to block bad traffic. Congratulations! But there should be something nagging at you: what’s special about the number 20? Why isn’t the limit 19 or 21? Or 19.77? Ideally you should have some principled way of determining which one of these options, or in fact which real number, is best. And if you use an algorithm to scan historical data and find the best classification rule according to some mathematical definition of “best,” this process is called machine learning. More generally, machine learning is the process of using historical data to create a prediction algorithm for future data. The task we just considered was one of classii‐ cation: determine which class a new data point (the request) falls into. Classification can be binary, as we just saw, in which there are only two classes, or multiclass; for example, if you want to determine whether a piece of malware is ransomware, a key‐ logger, or a remote access trojan. 1 Simple algorithms like this one are usually called “rules.” 26 | Chapter 2: Classifying and Clustering Machine learning can also be used to solve regression problems, in which we try to predict the value of a real-number variable. For example, you might want to predict the number of phishing emails an employee receives in a given month, given data about their position, access privileges, tenure in the company, security hygiene score, and so on. Regression problems for which the inputs have a time dimension are sometimes called time series analysis; for example, predicting the value of a stock tomorrow given its past performance, or the number of account sign-ins from the Seattle office given a known history. Anomaly detection is a layer on top of regression: it refers to the problem of determining when an observed value is sufficiently differ‐ ent from a predicted value to indicate that something unusual is going on. Machine learning is also used to solve clustering problems: given a bunch of data points, which ones are similar to one another? For example, if you are trying to ana‐ lyze a large dataset of internet traffic to your site, you might want to know which requests group together. Some clusters might be botnets, some might be mobile pro‐ viders, and some might be legitimate users. Machine learning can be supervised, in which case you have labels on historical data and you are trying to predict labels on future data. For example, given a large corpus of emails labeled as spam or ham, you can train a spam classifier that tries to predict whether a new incoming message is spam. Alternatively, machine learning can be unsupervised, in which case you have no labels on the historical data; you might not even know what the labels are that you’re trying to predict, for example if you have an