ML.pdf
Document Details
Uploaded by Deleted User
Full Transcript
About Pearson Pearson is the world’s learning company, with presence across 70 countries worldwide. Our unique insights and world-class expertise comes from a long history of working closely with renowned teachers, authors and thought leaders, as a result of which, we have emer...
About Pearson Pearson is the world’s learning company, with presence across 70 countries worldwide. Our unique insights and world-class expertise comes from a long history of working closely with renowned teachers, authors and thought leaders, as a result of which, we have emerged as the preferred choice for millions of teachers and learners across the world. We believe learning opens up opportunities, creates fulfilling careers and hence better lives. We hence collaborate with the best of minds to deliver you class-leading products, spread across the Higher Education and K12 spectrum. Superior learning experience and improved outcomes are at the heart of everything we do. This product is the result of one such effort. Your feedback plays a critical role in the evolution of our products and you can contact us at - [email protected]. We look forward to it. Machine Learning Machine Learning Saikat Dutt Director Cognizant Technology Solutions Subramanian Chandramouli Associate Director Cognizant Technology Solutions Amit Kumar Das Assistant Professor Institute of Engineering & Management This book is dedicated to the people without whom the dream could not have come true – my parents, Tarun Kumar Dutt and Srilekha Dutt; constant inspiration and support from my wife, Adity and unconditional love of my sons Deepro and Devarko. Saikat Dutt My sincerest thanks and appreciation go to several people… My parents Subramanian and Lalitha My wife Ramya My son Shri Krishna My daughter Shri Siva Ranjani And my colleagues and friends S. Chandramouli My humble gratitude goes to - My parents Mrs Juthika Das and Dr Ranajit Kumar Das My wife Arpana and two lovely daughters Ashmita and Ankita My academic collaborator, mentor and brother — Dr Saptarsi Goswami and Mr Mrityunjoy Panday Amit Kumar Das Contents Preface Acknowledgements About the Authors Model Syllabus for Machine Learning Lesson plan 1 Introduction to Machine Learning 1.1 Introduction 1.2 What is Human Learning? 1.3 Types of Human Learning 1.3.1 Learning under expert guidance 1.3.2 Learning guided by knowledge gained from experts 1.3.3 Learning by self 1.4 What is Machine Learning? 1.4.1 How do machines learn? 1.4.2 Well-posed learning problem 1.5 Types of Machine Learning 1.5.1 Supervised learning 1.5.2 Unsupervised learning 1.5.3 Reinforcement learning 1.5.4 Comparison – supervised, unsupervised, and reinforcement learning 1.6 Problems Not To Be Solved Using Machine Learning 1.7 Applications of Machine Learning 1.7.1 Banking and finance 1.7.2 Insurance 1.7.3 Healthcare 1.8 State-of-The-Art Languages/Tools In Machine Learning 1.8.1 Python 1.8.2 R 1.8.3 Matlab 1.8.4 SAS 1.8.5 Other languages/tools 1.9 Issues in Machine Learning 1.10 Summary 2 Preparing to Model 2.1 Introduction 2.2 Machine Learning Activities 2.3 Basic Types of Data in Machine Learning 2.4 Exploring Structure of Data 2.4.1 Exploring numerical data 2.4.2 Plotting and exploring numerical data 2.4.3 Exploring categorical data 2.4.4 Exploring relationship between variables 2.5 Data Quality and Remediation 2.5.1 Data quality 2.5.2 Data remediation 2.6 Data Pre-Processing 2.6.1 Dimensionality reduction 2.6.2 Feature subset selection 2.7 Summary 3 Modelling and Evaluation 3.1 Introduction 3.2 Selecting a Model 3.2.1 Predictive models 3.2.2 Descriptive models 3.3 Training a Model (for Supervised Learning) 3.3.1 Holdout method 3.3.2 K-fold Cross-validation method 3.3.3 Bootstrap sampling 3.3.4 Lazy vs. Eager learner 3.4 Model Representation and Interpretability 3.4.1 Underfitting 3.4.2 Overfitting 3.4.3 Bias – variance trade-off 3.5 Evaluating Performance of a Model 3.5.1 Supervised learning – classification 3.5.2 Supervised learning – regression 3.5.3 Unsupervised learning – clustering 3.6 Improving Performance of a Model 3.7 Summary 4 Basics of Feature Engineering 4.1 Introduction 4.1.1 What is a feature? 4.1.2 What is feature engineering? 4.2 Feature Transformation 4.2.1 Feature construction 4.2.2 Feature extraction 4.3 Feature Subset Selection 4.3.1 Issues in high-dimensional data 4.3.2 Key drivers of feature selection – feature relevance and redundancy 4.3.3 Measures of feature relevance and redundancy 4.3.4 Overall feature selection process 4.3.5 Feature selection approaches 4.4 Summary 5 Brief Overview of Probability 5.1 Introduction 5.2 Importance of Statistical Tools in Machine Learning 5.3 Concept of Probability – Frequentist and Bayesian Interpretation 5.3.1 A brief review of probability theory 5.4 Random Variables 5.4.1 Discrete random variables 5.4.2 Continuous random variables 5.5 Some Common Discrete Distributions 5.5.1 Bernoulli distributions 5.5.2 Binomial distribution 5.5.3 The multinomial and multinoulli distributions 5.5.4 Poisson distribution 5.6 Some Common Continuous Distributions 5.6.1 Uniform distribution 5.6.2 Gaussian (normal) distribution 5.6.3 The laplace distribution 5.7 Multiple Random Variables 5.7.1 Bivariate random variables 5.7.2 Joint distribution functions 5.7.3 Joint probability mass functions 5.7.4 Joint probability density functions 5.7.5 Conditional distributions 5.7.6 Covariance and correlation 5.8 Central Limit Theorem 5.9 Sampling Distributions 5.9.1 Sampling with replacement 5.9.2 Sampling without replacement 5.9.3 Mean and variance of sample 5.10 Hypothesis Testing 5.11 Monte Carlo Approximation 5.12 Summary 6 Bayesian Concept Learning 6.1 Introduction 6.2 Why Bayesian Methods are Important? 6.3 Bayes’ Theorem 6.3.1 Prior 6.3.2 Posterior 6.3.3 Likelihood 6.4 Bayes’ Theorem and Concept Learning 6.4.1 Brute-force Bayesian algorithm 6.4.2 Concept of consistent learners 6.4.3 Bayes optimal classifier 6.4.4 Naïve Bayes classifier 6.4.5 Applications of Naïve Bayes classifier 6.4.6 Handling Continuous Numeric Features in Naïve Bayes Classifier 6.5 Bayesian Belief Network 6.5.1 Independence and conditional independence 6.5.2 Use of the Bayesian Belief network in machine learning 6.6 Summary 7 Supervised Learning : Classification 7.1 Introduction 7.2 Example of Supervised Learning 7.3 Classification Model 7.4 Classification Learning Steps 7.5 Common Classification Algorithms 7.5.1 k-Nearest Neighbour (kNN) 7.5.2 Decision tree 7.5.3 Random forest model 7.5.4 Support vector machines 7.6 Summary 8 Super vised Learning : Regression 8.1 Introduction 8.2 Example of Regression 8.3 Common Regression Algorithms 8.3.1 Simple linear regression 8.3.2 Multiple linear regression 8.3.3 Assumptions in Regression Analysis 8.3.4 Main Problems in Regression Analysis 8.3.5 Improving Accuracy of the Linear Regression Model 8.3.6 Polynomial Regression Model 8.3.7 Logistic Regression 8.3.8 Maximum Likelihood Estimation 8.4 Summary 9 Unsupervised Learning 9.1 Introduction 9.2 Unsupervised vs Supervised Learning 9.3 Application of Unsupervised Learning 9.4 Clustering 9.4.1 Clustering as a machine learning task 9.4.2 Different types of clustering techniques 9.4.3 Partitioning methods 9.4.4 K-Medoids: a representative object-based technique 9.4.5 Hierarchical clustering 9.4.6 Density-based methods - DBSCAN 9.5 Finding Pattern using Association Rule 9.5.1 Definition of common terms 9.5.2 Association rule 9.5.3 The apriori algorithm for association rule learning 9.5.4 Build the apriori principle rules 9.6 Summary 10 Basics of Neural Network 10.1 Introduction 10.2 Understanding the Biological Neuron 10.3 Exploring the Artificial Neuron 10.4 Types of Activation Functions 10.4.1 Identity function 10.4.2 Threshold/step function 10.4.3 ReLU (Rectified Linear Unit) function 10.4.4 Sigmoid function 10.4.5 Hyperbolic tangent function 10.5 Early Implementations of ANN 10.5.1 McCulloch–Pitts model of neuron 10.5.2 Rosenblatt’s perceptron 10.5.3 ADALINE network model 10.6 Architectures of Neural Network 10.6.1 Single-layer feed forward network 10.6.2 Multi-layer feed forward ANNs 10.6.3 Competitive network 10.6.4 Recurrent network 10.7 Learning Process in ANN 10.7.1 Number of layers 10.7.2 Direction of signal flow 10.7.3 Number of nodes in layers 10.7.4 Weight of interconnection between neurons 10.8 Backpropagation 10.9 Deep Learning 10.10 Summary 11 Other Types of Learning 11.1 Introduction 11.2 Representation Learning 11.2.1 Supervised neural networks and multilayer perceptron 11.2.2 Independent component analysis (Unsupervised) 11.2.3 Autoencoders 11.2.4 Various forms of clustering 11.3 Active Learning 11.3.1 Heuristics for active learning 11.3.2 Active learning query strategies 11.4 Instance-Based Learning (Memory-based Learning) 11.4.1 Radial basis function 11.4.2 Pros and cons of instance-based learning method 11.5 Association Rule Learning Algorithm 11.5.1 Apriori algorithm 11.5.2 Eclat algorithm 11.6 Ensemble Learning Algorithm 11.6.1 Bootstrap aggregation (Bagging) 11.6.2 Boosting 11.6.3 Gradient boosting machines (GBM) 11.7 Regularization Algorithm 11.8 Summary Appendix A: Programming Machine Learning in R Appendix B: Programming Machine Learning in Python Appendix C: A Case Study on Machine Learning Application: Grouping Similar Service Requests and Classifying a New One Model Question Paper-1 Model Question Paper-2 Model Question Paper-3 Index Preface Repeated requests from Computer Science and IT engineering students who are the readers of our previous books encouraged us to write this book on machine learning. The concept of machine learning and the huge potential of its application is still niche knowledge and not so well-spread among the student community. So, we thought of writing this book specifically for techies, college students, and junior managers so that they understood the machine learning concepts easily. They should not only use the machine learning software packages, but understand the concepts behind those packages. The application of machine learning is getting boosted day by day. From recommending products to the buyers, to predicting the future real estate market, to helping medical practitioners in diagnosis, to optimizing energy consumption, thus helping the cause of Green Earth, machine learning is finding its utility in every sphere of life. Due care was taken to write this book in simple English and to present the machine learning concepts in an easily understandable way which can be used as a textbook for both graduate and advanced undergraduate classes in machine learning or as a reference text. Whom Is This Book For? Readers of this book will gain a thorough understanding of machine learning concepts. Not only students but also software professionals will find a variety of techniques with sufficient discussions in this book that cater to the needs of the professional environments. Technical managers will get an insight into weaving machine learning into the overall software engineering process. Students, developers, and technical managers with a basic background in computer science will find the material in this book easily readable. How This Book Is Organized? Each chapter starts with an introductory paragraph, which gives overview of the chapter along with a chapter-coverage table listing out the topics covered in the chapter. Sample questions at the end of the chapter helps the students to prepare for the examination. Discussion points and Points to Ponder given inside the chapters help to clarify and understand the chapters easily and also explore the thinking ability of the students and professionals. A recap summary is given at the end of each chapter for quick review of the topics. Throughout this book, you will see many exercises and discussion questions. Please don’t skip these – they are important in order to understand the machine learning concepts fully. This book starts with an introduction to Machine Learning which lays the theoretical foundation for the remaining chapters. Modelling, Feature engineering, and basic probability are discussed as chapters before entering into the world of machine learning which helps to grip the machine learning concepts easily at a later point of time. Bonus topics of machine learning exercise with multiple examples are discussed in Machine learning language of R & Python. Appendix A discusses Programming Machine Learning in R and Appendix B discusses Programming Machine Learning in Python. Acknowledgements We are grateful to Pearson Education, who came forward to publish this book. Ms. Neha Goomer and Mr. Purushothaman Chandrasekaran of Pearson Education were always kind and understanding. Mr. Purushothaman reviewed this book with abundant patience and helped us say what we had wanted to, improvising each and every page of this book with care. Their patience and guidance were invaluable. Thank you very much Neha and Puru. –All authors The journey through traditional Project Management, Agile Project Management, Program and Portfolio Management along with use of artificial intelligence and machine learning in the field, has been very rewarding, as it has given us the opportunity to work for some of the best organizations in the world and learn from some of the best minds. Along the way, several individuals have been instrumental in providing us with the guidance, opportunities and insights we needed to excel in this field. We wish to personally thank Mr. Rajesh Balaji Ramachandran, Senior Vice-president, Cognizant Technology Solutions and Mr. Pradeep Shilige, Executive Vice President, Cognizant Technology Solutions; Mr. Alexis Samuel, Senior Vice-president, Cognizant Technology Solutions and Mr.Hariharan Mathrubutham,Vice-president, Cognizant Technology Solutions and Mr. Krishna Prasad Yerramilli, Assistant Vice-president, Cognizant Technology Solutions for their inspiration and help in creating this book. They have immensely contributed to improve our skills. –Saikat and Chandramouli This book wouldn’t have been possible without the constant inspiration and support of my lovely wife Adity. My parents have always been enthusiastic about this project and provided me continuous guidance at every necessary step. The unconditional love and affection my sons – Deepro and Devarko ushered on me constantly provided me the motivation to work hard on this crucial project. This book is the culmination of all the learning that I gained from many highly reputed professionals in the industry. I was fortunate to work with them and gained knowledge from them which helped me molding my professional career. Prof. Indranil Bose and Prof. Bodhibrata Nag from IIM Calcutta guided me enormously in different aspects of life and career. My heartfelt thanks go to all the wonderful people who contributed in many ways in conceptualizing this book and wished success of this project. –Saikat Dutt This book is the result of all the learning I have gained from many highly reputed professionals in the industry. I was fortunate to work with them and in the process, acquire knowledge that helped me in molding my professional career. I thank Mr. Chandra Sekaran, Group Chief Executive,Tech and Ops, Cognizant, for his continuous encouragement and unabated passion for strategic value creation, whose advice was invaluable for working on this book. I am obliged to Mr. Chandrasekar, Ex CIO, Standard Chartered Bank, for demonstrating how the lives, thoughts and feelings of others in professional life are to be valued. He is a wonderful and cheerful man who inspired me and gave me a lot of encouragement when he launched my first book, Virtual Project Management Office. Ms. Meena Karthikeyan, Vice-president, Cognizant Technology Solutions; Ms. Kalyani Sekhar, Assistant Vice- president, Cognizant Technology Solutions and Mr. Balasubramanian Narayanan, Senior Director, Cognizant Technology Solutions guided me enormously in different aspects of professional life. My parents (Mr. Subramanian and Ms. Lalitha) have always been enthusiastic about this project. Their unconditional love and affection provided the much-needed moral support. My son, Shri Krishna, and daughter, Shri Siva Ranjani, constantly added impetus to my motivation to work hard. This book would not have been possible without the constant inspiration and support of my wife, Ramya. She was unfailingly supportive and encouraging during the long months that I had spent glued to my laptop while writing this book. Last and not the least, I beg forgiveness of all those who have been with me over the course of the years and whose names I have failed to mention here. –S. Chandramouli First of all, I would like to thank the Almighty for everything. It is the constant support and encouragement from my family which made it possible for me to put my heart and soul in my first authoring venture. My parents have always been my role model, my wife a source of constant strength and my daughters are my happiness. Without my family, I wouldn’t have got the luxury of time to spend on hours in writing the book. Any amount of thanks will fell short to express my gratitude and pleasure of being part of the family. I would also thank the duo who have been my academic collaborator, mentor and brother - Dr. Saptarsi Goswami and Mr. Mrityunjoy Panday - without them I would be so incomplete. My deep respects for my research guides and mentors Amlan sir and Basabi madam for the knowledge and support that I’ve been privileged to receive from them. My sincere gratitude to my mentors in my past organization, Cognizant Technology Solutions, Mr. Rajarshi Chatterjee, Mr. Manoj Paul, Mr. Debapriya Dasgupta and Mr. Balaji Venkatesan for the valuable learning that I received from them. I would thank all my colleagues at Institute of Engineering & Management, who have given me relentless support and encouragement. Last, but not the least, my students who are always a source of inspiration for me need special mention. I’m especially indebted to Goutam Bose, Sayan Bachhar and Piyush Nandi for their extreme support in reviewing the chapters and providing invaluable feedback to make them more student- friendly. Also, thanks to Gulshan, Sagarika, Samyak, Sahil, Priya, Nibhash, Attri, Arunima, Salman, Deepayan, Arnab, Swapnil, Gitesh, Ranajit, Ankur, Sudipta and Debankan for their great support. –Amit Kumar Das About the Authors Saikat Dutt, PMP, PMI-ACP, CSM is author of three books ‘PMI Agile Certified Practitioner-Excel with Ease’, ‘Software Engineering’ and ‘Software Project Management’ published by Pearson. Two of these books - ‘Software Project Management’ and ‘PMI Agile Certified Practitioner-Excel with Ease’ are text book in IIMC for PGDBA class. Saikat is working on AI and Machine Learning projects especially focusing on application of those in managing software projects. He is a regular speaker on machine learning topics and involved in reviewing machine learning and AI related papers. Saikat is also a ‘Project Management Professional (PMP)’ and ‘PMI Agile Certified Professional’ certified by Project Management Institute (PMI) USA and a Certified Scrum Master (CSM). He has more than Nineteen years of IT industry experience and has expertise in managing large scale multi-location and mission critical projects. Saikat holds a B.E. degree from Jadavpur University, Kolkata and at present he is working as Director in Cognizant Technology Solutions. He is a guest lecturer in IIMC since 2016 for Software Project Management course in PGDBA class. Saikat is also an active speaker on Agile, Project management, Machine learning and other recent topics several forums. He is actively working with IIMC to develop management case studies which are taught in global business schools. S. Chandramouli, PMP, PMI-ACP, is an alumnus of the Indian Institute of Management Kozhikode (IIM-K). He is a Certified Global Business Leader from Harvard Business School. He is a prolific writer of business management articles dealing with delivery management, competitiveness, IT, organizational culture and leadership. Author of books published by Pearson which has been recognized as a reference books in various universities, title includes ‘PMI Agile Certified Practitioner—Excel with Ease’, ‘Software Engineering’, ‘Software Project Management’. In addition to this, he has edited two books of foreign authors to suit the needs of Indian universities: ‘Applying UML and patterns’ by Craig Larman and ‘Design pattern by Gamma’ (Gang of four). He is certified ‘Green Belt’ in six sigma methodology. He is a certified master practitioner in Neuro Linguistic Programming (NLP). Chandramouli has a good record of delivering large-scale, mission-critical projects on time and within budget to the customer’s satisfaction which includes AI and machine learning projects. He was an active member in PMI’s Organization Project Management Maturity Model (OPM3) and Project Management Competency Development Framework (PMCDF) assignments. He has been an invited speaker at various technology and management conferences and has addressed more than 7,000 software professionals worldwide on a variety of themes associated with AI, machine learning, delivery management, competitiveness and leadership. Amit Kumar Das is a seasoned industry practitioner turned to a full-time academician. He is currently working as an Assistant Professor at Institute of Engineering & Management (IEM). He is also guest teacher in the Department of Radiophysics and Electronics, University of Calcutta. Before joining academics, he was a Director in the Analytics and Information Management practice in Cognizant Technology Solutions. Amit has spent more than 18 years in the IT industry in diverse roles and working with stakeholders across the globe. Amit has done his Bachelor in Engineering from Indian Institute of Engineering Science and Technology (IIEST), Shibpur and his Master in Technology from Birla Institute of Technology and Science (BITS), Pilani. Currently he is pursuing his research in the University of Calcutta. Amit’s area of research includes machine learning and deep learning. He has many published research papers in the area of data analytics and machine learning published in referred international journals and conferences. He has also been a regular speaker in the area of software engineering, data analytics and machine learning. Model Syllabus for Machine Learning Credits: 5 Contacts per week: 3 lectures + 1 tutorial MODULE I Introduction to Machine Learning: Human learning and it’s types; Machine learning and it’s types; well-posed learning problem; applications of machine learning; issues in machine learning Preparing to model: Basic data types; exploring numerical data; exploring categorical data; exploring relationship between variables; data issues and remediation; data pre- processing Modelling and Evaluation: Selecting a model; training model – holdout, k-fold cross-validation, bootstrap sampling; model representation and interpretability – under-fitting, over-fitting, bias-variance tradeoff; model performance evaluation – classification, regression, clustering; performance improvement Feature engineering: Feature construction; feature extraction; feature selection [12 Lectures] MODULE II Brief review of probability: Basic concept of probability, random variables; discrete distributions – binomial, Poisson, Bernoulli, etc.; continuous distribution – uniform, normal, Laplace; central theorem; Monte Carlo approximation Bayesian concept learning: Bayes theorem – prior and posterior probability, likelihood; concept learning; Bayesian Belief Network [6 Lectures] MODULE III Supervised learning – Classification: Basics of supervised learning – classification; k-Nearestneighbour; decision tree; random forest; support vector machine Supervised learning – Regression: Simple linear regression; other regression techniques Unsupervised learning: Basics of unsupervised learning; clustering techniques; association rules [10 Lectures] MODULE IV Basics of Neural Network: Understanding biological neuron and artificial neuron; types of activation functions; early implementations – McCulloch Pitt’s, Rosenblatt’s Perceptron, ADALINE; architectures of neural network; learning process in ANN; backpropagation Other types of learning: Representation learning; active learning; instance-based learning; association rule learning; ensemble learning; regularization Machine Learning Live Case Studies [8 Lectures] Lesson Plan Chapter 1 Introduction to Machine Learning OBJECTIVE OF THE CHAPTER : The objective of this chapter is to venture into the arena of machine learning. New comers struggle a lot to understand the philosophy of machine learning. Also, they do not know where to start from and which problem could be and should be solved using machine learning tools and techniques. This chapter intends to give the new comers a starting point to the journey in machine learning. It starts from a historical journey in this field and takes it forward to give a glimpse of the modern day applications. 1.1 INTRODUCTION It has been more than 20 years since a computer program defeated the reigning world champion in a game which is considered to need a lot of intelligence to play. The computer program was IBM’s Deep Blue and it defeated world chess champion, Gary Kasparov. That was the time, probably, when the most number of people gave serious attention to a fast- evolving field in computer science or more specifically artificial intelligence – i.e. machine learning (ML). As of today, machine learning is a mature technology area finding its application in almost every sphere of life. It can recommend toys to toddlers much in the same way as it can suggest a technology book to a geek or a rich title in literature to a writer. It predicts the future market to help amateur traders compete with seasoned stock traders. It helps an oncologist find whether a tumour is malignant or benign. It helps in optimizing energy consumption thus helping the cause of Green Earth. Google has become one of the front-runners focusing a lot of its research on machine learning and artificial intelligence – Google self-driving car and Google Brain being two most ambitious projects of Google in its journey of innovation in the field of machine learning. In a nutshell, machine learning has become a way of life, no matter whichever sphere of life we closely look at. But where did it all start from? The foundation of machine learning started in the 18th and 19th centuries. The first related work dates back to 1763. In that year, Thomas Bayes’s work ‘An Essay towards solving a Problem in the Doctrine of Chances’ was published two years after his death. This is the work underlying Bayes Theorem, a fundamental work on which a number of algorithms of machine learning is based upon. In 1812, the Bayes theorem was actually formalized by the French mathematician Pierre- Simon Laplace. The method of least squares, which is the foundational concept to solve regression problems, was formalized in 1805. In 1913, Andrey Markov came up with the concept of Markov chains. However, the real start of focused work in the field of machine learning is considered to be Alan Turing’s seminal work in 1950. In his paper ‘Computing Machinery and Intelligence’ (Mind, New Series, Vol. 59, No. 236, Oct., 1950, pp. 433–460), Turing posed the question ‘Can machines think?’ or in other words, ‘Do machines have intelligence?’. He was the first to propose that machines can ‘learn’ and become artificially intelligent. In 1952, Arthur Samuel of IBM laboratory started working on machine learning programs, and first developed programs that could play Checkers. In 1957, Frank Rosenblatt designed the first neural network program simulating the human brain. From then on, for the next 50 years, the journey of machine learning has been fascinating. A number of machine learning algorithms were formulated by different researchers, e.g. the nearest neighbour algorithm in 1969, recurrent neural network in 1982, support vector machines and random forest algorithms in 1995. The latest feather in the cap of machine learning development has been Google’s AlphaGo program, which has beaten professional human Go player using machine learning techniques. Points to Ponder While Deep Blue was searching some 200 million positions per second, Kasparov was searching not more than 5–10 positions probably, per second. Yet he played almost at the same level. This clearly shows that humans have some trick up their sleeve that computers could not master yet. Go is a board game which can be played by two players. It was invented in China almost 2500 years ago and is considered to be the oldest board game. Though it has relatively simple rules, Go is a very complex game (more complex than chess) because of its larger board size and more number of possible moves. The evolution of machine learning from 1950 is depicted in Figure 1.1. The rapid development in the area of machine learning has triggered a question in everyone’s mind – can machines learn better than human? To find its answer, the first step would be to understand what learning is from a human perspective. Then, more light can be shed on what machine learning is. In the end, we need to know whether machine learning has already surpassed or has the potential to surpass human learning in every facet of life. FIG. 1.1 Evolution of machine learning 1.2 WHAT IS HUMAN LEARNING? In cognitive science, learning is typically referred to as the process of gaining information through observation. And why do we need to learn? In our daily life, we need to carry out multiple activities. It may be a task as simple as walking down the street or doing the homework. Or it may be some complex task like deciding the angle in which a rocket should be launched so that it can have a particular trajectory. To do a task in a proper way, we need to have prior information on one or more things related to the task. Also, as we keep learning more or in other words acquiring more information, the efficiency in doing the tasks keep improving. For example, with more knowledge, the ability to do homework with less number of mistakes increases. In the same way, information from past rocket launches helps in taking the right precautions and makes more successful rocket launch. Thus, with more learning, tasks can be performed more efficiently. 1.3 TYPES OF HUMAN LEARNING Thinking intuitively, human learning happens in one of the three ways – (1) either somebody who is an expert in the subject directly teaches us, (2) we build our own notion indirectly based on what we have learnt from the expert in the past, or (3) we do it ourselves, may be after multiple attempts, some being unsuccessful. The first type of learning, we may call, falls under the category of learning directly under expert guidance, the second type falls under learning guided by knowledge gained from experts and the third type is learning by self or self-learning. Let’s look at each of these types deeply using real-life examples and try to understand what they mean. 1.3.1 Learning under expert guidance An infant may inculcate certain traits and characteristics, learning straight from its guardians. He calls his hand, a ‘hand’, because that is the information he gets from his parents. The sky is ‘blue’ to him because that is what his parents have taught him. We say that the baby ‘learns’ things from his parents. The next phase of life is when the baby starts going to school. In school, he starts with basic familiarization of alphabets and digits. Then the baby learns how to form words from the alphabets and numbers from the digits. Slowly more complex learning happens in the form of sentences, paragraphs, complex mathematics, science, etc. The baby is able to learn all these things from his teacher who already has knowledge on these areas. Then starts higher studies where the person learns about more complex, application-oriented skills. Engineering students get skilled in one of the disciplines like civil, computer science, electrical, mechanical, etc. medical students learn about anatomy, physiology, pharmacology, etc. There are some experts, in general the teachers, in the respective field who have in-depth subject matter knowledge, who help the students in learning these skills. Then the person starts working as a professional in some field. Though he might have gone through enough theoretical learning in the respective field, he still needs to learn more about the hands-on application of the knowledge that he has acquired. The professional mentors, by virtue of the knowledge that they have gained through years of hands-on experience, help all new comers in the field to learn on-job. In all phases of life of a human being, there is an element of guided learning. This learning is imparted by someone, purely because of the fact that he/she has already gathered the knowledge by virtue of his/her experience in that field. So guided learning is the process of gaining information from a person having sufficient knowledge due to the past experience. 1.3.2 Learning guided by knowledge gained from experts An essential part of learning also happens with the knowledge which has been imparted by teacher or mentor at some point of time in some other form/context. For example, a baby can group together all objects of same colour even if his parents have not specifically taught him to do so. He is able to do so because at some point of time or other his parents have told him which colour is blue, which is red, which is green, etc. A grown-up kid can select one odd word from a set of words because it is a verb and other words being all nouns. He could do this because of his ability to label the words as verbs or nouns, taught by his English teacher long back. In a professional role, a person is able to make out to which customers he should market a campaign from the knowledge about preference that was given by his boss long back. In all these situations, there is no direct learning. It is some past information shared on some different context, which is used as a learning to make decisions. 1.3.3 Learning by self In many situations, humans are left to learn on their own. A classic example is a baby learning to walk through obstacles. He bumps on to obstacles and falls down multiple times till he learns that whenever there is an obstacle, he needs to cross over it. He faces the same challenge while learning to ride a cycle as a kid or drive a car as an adult. Not all things are taught by others. A lot of things need to be learnt only from mistakes made in the past. We tend to form a check list on things that we should do, and things that we should not do, based on our experiences. 1.4 WHAT IS MACHINE LEARNING? Before answering the question ‘What is machine learning?’ more fundamental questions that peep into one’s mind are Do machines really learn? If so, how do they learn? Which problem can we consider as a well-posed learning problem? What are the important features that are required to well-define a learning problem? At the onset, it is important to formalize the definition of machine learning. This will itself address the first question, i.e. if machines really learn. There are multiple ways to define machine learning. But the one which is perhaps most relevant, concise and accepted universally is the one stated by Tom M. Mitchell, Professor of Machine Learning Department, School of Computer Science, Carnegie Mellon University. Tom M. Mitchell has defined machine learning as ‘A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.’ What this essentially means is that a machine can be considered to learn if it is able to gather experience by doing a certain task and improve its performance in doing the similar tasks in the future. When we talk about past experience, it means past data related to the task. This data is an input to the machine from some source. In the context of the learning to play checkers, E represents the experience of playing the game, T represents the task of playing checkers and P is the performance measure indicated by the percentage of games won by the player. The same mapping can be applied for any other machine learning problem, for example, image classification problem. In context of image classification, E represents the past data with images having labels or assigned classes (for example whether the image is of a class cat or a class dog or a class elephant etc.), T is the task of assigning class to new, unlabelled images and P is the performance measure indicated by the percentage of images correctly classified. The first step in any project is defining your problem. Even if the most powerful algorithm is used, the results will be meaningless if the wrong problem is solved. 1.4.1 How do machines learn? The basic machine learning process can be divided into three parts. 1. Data Input: Past data or information is utilized as a basis for future decision-making 2. Abstraction: The input data is represented in a broader way through the underlying algorithm 3. Generalization: The abstracted representation is generalized to form a framework for making decisions Figure 1.2 is a schematic representation of the machine learning process. FIG. 1.2 Process of machine learning Let’s put the things in perspective of the human learning process and try to understand the machine learning process more clearly. Reason is, in some sense, machine learning process tries to emulate the process in which humans learn to a large extent. Let’s consider the situation of typical process of learning from classroom and books and preparing for the examination. It is a tendency of many students to try and memorize (we often call it ‘learn by heart’) as many things as possible. This may work well when the scope of learning is not so vast. Also, the kinds of questions which are asked in the examination are pretty much simple and straightforward. The questions can be answered by simply writing the same things which have been memorized. However, as the scope gets broader and the questions asked in the examination gets more complex, the strategy of memorizing doesn’t work well. The number of topics may get too vast for a student to memorize. Also, the capability of memorizing varies from student to student. Together with that, since the questions get more complex, a direct reproduction of the things memorized may not help. The situation continues to get worse as the student graduates to higher classes. So, what we see in the case of human learning is that just by great memorizing and perfect recall, i.e. just based on knowledge input, students can do well in the examinations only till a certain stage. Beyond that, a better learning strategy needs to be adopted: 1. to be able to deal with the vastness of the subject matter and the related issues in memorizing it 2. to be able to answer questions where a direct answer has not been learnt A good option is to figure out the key points or ideas amongst a vast pool of knowledge. This helps in creating an outline of topics and a conceptual mapping of those outlined topics with the entire knowledge pool. For example, a broad pool of knowledge may consist of all living animals and their characteristics such as whether they live in land or water, whether they lay eggs, whether they have scales or fur or none, etc. It is a difficult task for any student to memorize the characteristics of all living animals – no matter how much photographic memory he/she may possess. It is better to draw a notion about the basic groups that all living animals belong to and the characteristics which define each of the basic groups. The basic groups of animals are invertebrates and vertebrates. Vertebrates are further grouped as mammals, reptiles, amphibians, fishes, and birds. Here, we have mapped animal groups and their salient characteristics. 1. Invertebrate: Do not have backbones and skeletons 2. Vertebrate 1. Fishes: Always live in water and lay eggs 2. Amphibians: Semi-aquatic i.e. may live in water or land; smooth skin; lay eggs 3. Reptiles: Semi-aquatic like amphibians; scaly skin; lay eggs; cold-blooded 4. Birds: Can fly; lay eggs; warm-blooded 5. Mammals: Have hair or fur; have milk to feed their young; warm-blooded This makes it easier to memorize as the scope now reduces to know the animal groups that the animals belong to. Rest of the answers about the characteristics of the animals may be derived from the concept of mapping animal groups and their characteristics. Moving to the machine learning paradigm, the vast pool of knowledge is available from the data input. However, rather than using it in entirety, a concept map, much in line with the animal group to characteristic mapping explained above, is drawn from the input data. This is nothing but knowledge abstraction as performed by the machine. In the end, the abstracted mapping from the input data can be applied to make critical conclusions. For example, if the group of an animal is given, understanding of the characteristics can be automatically made. Reversely, if the characteristic of an unknown animal is given, a definite conclusion can be made about the animal group it belongs to. This is generalization in context of machine learning. 1.4.1.1 Abstraction During the machine learning process, knowledge is fed in the form of input data. However, the data cannot be used in the original shape and form. As we saw in the example above, abstraction helps in deriving a conceptual map based on the input data. This map, or a model as it is known in the machine learning paradigm, is summarized knowledge representation of the raw data. The model may be in any one of the following forms Computational blocks like if/else rules Mathematical equations Specific data structures like trees or graphs Logical groupings of similar observations The choice of the model used to solve a specific learning problem is a human task. The decision related to the choice of model is taken based on multiple aspects, some of which are listed below: The type of problem to be solved: Whether the problem is related to forecast or prediction, analysis of trend, understanding the different segments or groups of objects, etc. Nature of the input data: How exhaustive the input data is, whether the data has no values for many fields, the data types, etc. Domain of the problem: If the problem is in a business critical domain with a high rate of data input and need for immediate inference, e.g. fraud detection problem in banking domain. Once the model is chosen, the next task is to fit the model based on the input data. Let’s understand this with an example. In a case where the model is represented by a mathematical equation, say ‘y = c1 + c2x’ (the model is known as simple linear regression which we will study in a later chapter), based on the input data, we have to find out the values of c1 and c2. Otherwise, the equation (or the model) is of no use. So, fitting the model, in this case, means finding the values of the unknown coefficients or constants of the equation or the model. This process of fitting the model based on the input data is known as training. Also, the input data based on which the model is being finalized is known as training data. 1.4.1.2 Generalization The first part of machine learning process is abstraction i.e. abstract the knowledge which comes as input data in the form of a model. However, this abstraction process, or more popularly training the model, is just one part of machine learning. The other key part is to tune up the abstracted knowledge to a form which can be used to take future decisions. This is achieved as a part of generalization. This part is quite difficult to achieve. This is because the model is trained based on a finite set of data, which may possess a limited set of characteristics. But when we want to apply the model to take decision on a set of unknown data, usually termed as test data, we may encounter two problems: 1. The trained model is aligned with the training data too much, hence may not portray the actual trend. 2. The test data possess certain characteristics apparently unknown to the training data. Hence, a precise approach of decision-making will not work. An approximate or heuristic approach, much like gut- feeling-based decision-making in human beings, has to be adopted. This approach has the risk of not making a correct decision – quite obviously because certain assumptions that are made may not be true in reality. But just like machines, same mistakes can be made by humans too when a decision is made based on intuition or gut-feeling – in a situation where exact reason-based decision-making is not possible. 1.4.2 Well-posed learning problem For defining a new problem, which can be solved using machine learning, a simple framework, highlighted below, can be used. This framework also helps in deciding whether the problem is a right candidate to be solved using machine learning. The framework involves answering three questions: 1. What is the problem? 2. Why does the problem need to be solved? 3. How to solve the problem? Step 1: What is the Problem? A number of information should be collected to know what is the problem. Informal description of the problem, e.g. I need a program that will prompt the next word as and when I type a word. Formalism Use Tom Mitchell’s machine learning formalism stated above to define the T, P, and E for the problem. For example: Task (T): Prompt the next word when I type a word. Experience (E): A corpus of commonly used English words and phrases. Performance (P): The number of correct words prompted considered as a percentage (which in machine learning paradigm is known as learning accuracy). Assumptions - Create a list of assumptions about the problem. Similar problems What other problems have you seen or can you think of that are similar to the problem that you are trying to solve? Step 2: Why does the problem need to be solved? Motivation What is the motivation for solving the problem? What requirement will it fulfil? For example, does this problem solve any long-standing business issue like finding out potentially fraudulent transactions? Or the purpose is more trivial like trying to suggest some movies for upcoming weekend. Solution benefits Consider the benefits of solving the problem. What capabilities does it enable? It is important to clearly understand the benefits of solving the problem. These benefits can be articulated to sell the project. Solution use How will the solution to the problem be used and the life time of the solution is expected to have? Step 3: How would I solve the problem? Try to explore how to solve the problem manually. Detail out step-by-step data collection, data preparation, and program design to solve the problem. Collect all these details and update the previous sections of the problem definition, especially the assumptions. Summary Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems. Step 2: Why does the problem need to be solved? List the motivation for solving the problem, the benefits that the solution will provide and how the solution will be used. Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge. Did you know? Sony created a series of robotic pets called Aibo. It was built in 1998. Although most models sold were dog-like, other inspirations included lion-cubs. It could express emotions and could also recognize its owner. In 2006, Aibo was added to the Carnegie Mellon University’s ‘Robot Hall of Fame’. A new generation of Aibo was launched in Japan in January 2018. 1.5 TYPES OF MACHINE LEARNING As highlighted in Figure 1.3, Machine learning can be classified into three broad categories: 1. Supervised learning – Also called predictive learning. A machine predicts the class of unknown objects based on prior class-related information of similar objects. 2. Unsupervised learning – Also called descriptive learning. A machine finds patterns in unknown objects by grouping similar objects together. 3. Reinforcement learning – A machine learns to act on its own to achieve the given goals. FIG. 1.3 Types of machine learning Did you know? Many video games are based on artificial intelligence technique called Expert System. This technique can imitate areas of human behaviour, with a goal to mimic the human ability of senses, perception, and reasoning. 1.5.1 Supervised learning The major motivation of supervised learning is to learn from past information. So what kind of past information does the machine need for supervised learning? It is the information about the task which the machine has to execute. In context of the definition of machine learning, this past information is the experience. Let’s try to understand it with an example. Say a machine is getting images of different objects as input and the task is to segregate the images by either shape or colour of the object. If it is by shape, the images which are of round-shaped objects need to be separated from images of triangular-shaped objects, etc. If the segregation needs to happen based on colour, images of blue objects need to be separated from images of green objects. But how can the machine know what is round shape, or triangular shape? Same way, how can the machine distinguish image of an object based on whether it is blue or green in colour? A machine is very much like a little child whose parents or adults need to guide him with the basic information on shape and colour before he can start doing the task. A machine needs the basic information to be provided to it. This basic input, or the experience in the paradigm of machine learning, is given in the form of training data. Training data is the past information on a specific task. In context of the image segregation problem, training data will have past data on different aspects or features on a number of images, along with a tag on whether the image is round or triangular, or blue or green in colour. The tag is called ‘ label’ and we say that the training data is labelled in case of supervised learning. Figure 1.4 is a simple depiction of the supervised learning process. Labelled training data containing past information comes as an input. Based on the training data, the machine builds a predictive model that can be used on test data to assign a label for each record in the test data. FIG. 1.4 Supervised learning Some examples of supervised learning are Predicting the results of a game Predicting whether a tumour is malignant or benign Predicting the price of domains like real estate, stocks, etc. Classifying texts such as classifying a set of emails as spam or non-spam Now, let’s consider two of the above examples, say ‘predicting whether a tumour is malignant or benign’ and ‘predicting price of domains such as real estate’. Are these two problems same in nature? The answer is ‘no’. Though both of them are prediction problems, in one case we are trying to predict which category or class an unknown data belongs to whereas in the other case we are trying to predict an absolute value and not a class. When we are trying to predict a categorical or nominal variable, the problem is known as a classification problem. Whereas when we are trying to predict a real-valued variable, the problem falls under the category of regression. Note: Supervised machine learning is as good as the data used to train it. If the training data is of poor quality, the prediction will also be far from being precise. Let’s try to understand these two areas of supervised learning, i.e. classification and regression in more details. 1.5.1.1 Classification Let’s discuss how to segregate the images of objects based on the shape. If the image is of a round object, it is put under one category, while if the image is of a triangular object, it is put under another category. In which category the machine should put an image of unknown category, also called a test data in machine learning parlance, depends on the information it gets from the past data, which we have called as training data. Since the training data has a label or category defined for each and every image, the machine has to map a new image or test data to a set of images to which it is similar to and assign the same label or category to the test data. So we observe that the whole problem revolves around assigning a label or category or class to a test data based on the label or category or class information that is imparted by the training data. Since the target objective is to assign a class label, this type of problem as classification problem. Figure 1.5 depicts the typical process of classification. FIG. 1.5 Classification There are number of popular machine learning algorithms which help in solving classification problems. To name a few, Naïve Bayes, Decision tree, and k-Nearest Neighbour algorithms are adopted by many machine learning practitioners. A critical classification problem in context of banking domain is identifying potential fraudulent transactions. Since there are millions of transactions which have to be scrutinized and assured whether it might be a fraud transaction, it is not possible for any human being to carry out this task. Machine learning is effectively leveraged to do this task and this is a classic case of classification. Based on the past transaction data, specifically the ones labelled as fraudulent, all new incoming transactions are marked or labelled as normal or suspicious. The suspicious transactions are subsequently segregated for a closer review. In summary, classification is a type of supervised learning where a target feature, which is of type categorical, is predicted for test data based on the information imparted by training data. The target categorical feature is known as class. Some typical classification problems include: Image classification Prediction of disease Win–loss prediction of games Prediction of natural calamity like earthquake, flood, etc. Recognition of handwriting Did you know? Machine learning saves life – ML can spot 52% of breast cancer cells, a year before patients are diagnosed. US Postal Service uses machine learning for handwriting recognition. Facebook’s news feed uses machine learning to personalize each member’s feed. 1.5.1.2. Regression In linear regression, the objective is to predict numerical features like real estate or stock price, temperature, marks in an examination, sales revenue, etc. The underlying predictor variable and the target variable are continuous in nature. In case of linear regression, a straight line relationship is ‘fitted’ between the predictor variables and the target variables, using the statistical concept of least squares method. As in the case of least squares method, the sum of square of error between actual and predicted values of the target variable is tried to be minimized. In case of simple linear regression, there is only one predictor variable whereas in case of multiple linear regression, multiple predictor variables can be included in the model. Let’s take the example of yearly budgeting exercise of the sales managers. They have to give sales prediction for the next year based on sales figure of previous years vis-à-vis investment being put in. Obviously, the data related to past as well as the data to be predicted are continuous in nature. In a basic approach, a simple linear regression model can be applied with investment as predictor variable and sales revenue as the target variable. Figure 1.6 shows a typical simple regression model, where regression line is fitted based on values of target variable with respect to different values of predictor variable. A typical linear regression model can be represented in the form – where ‘x’ is the predictor variable and ‘y’ is the target variable. The input data come from a famous multivariate data set named Iris introduced by the British statistician and biologist Ronald Fisher. The data set consists of 50 samples from each of three species of Iris – Iris setosa, Iris virginica, and Iris versicolor. Four features were measured for each sample – sepal length, sepal width, petal length, and petal width. These features can uniquely discriminate the different species of the flower. The Iris data set is typically used as a training data for solving the classification problem of predicting the flower species based on feature values. However, we can also demonstrate regression using this data set, by predicting the value of one feature using another feature as predictor. In Figure 1.6, petal length is a predictor variable which, when fitted in the simple linear regression model, helps in predicting the value of the target variable sepal length. FIG. 1.6 Regression Typical applications of regression can be seen in Demand forecasting in retails Sales prediction for managers Price prediction in real estate Weather forecast Skill demand forecast in job market 1.5.2 Unsupervised learning Unlike supervised learning, in unsupervised learning, there is no labelled training data to learn from and no prediction to be made. In unsupervised learning, the objective is to take a dataset as input and try to find natural groupings or patterns within the data elements or records. Therefore, unsupervised learning is often termed as descriptive model and the process of unsupervised learning is referred as pattern discovery or knowledge discovery. One critical application of unsupervised learning is customer segmentation. Clustering is the main type of unsupervised learning. It intends to group or organize similar objects together. For that reason, objects belonging to the same cluster are quite similar to each other while objects belonging to different clusters are quite dissimilar. Hence, the objective of clustering to discover the intrinsic grouping of unlabelled data and form clusters, as depicted in Figure 1.7. Different measures of similarity can be applied for clustering. One of the most commonly adopted similarity measure is distance. Two data items are considered as a part of the same cluster if the distance between them is less. In the same way, if the distance between the data items is high, the items do not generally belong to the same cluster. This is also known as distance-based clustering. Figure 1.8 depicts the process of clustering at a high level. FIG. 1.7 Distance-based clustering Other than clustering of data and getting a summarized view from it, one more variant of unsupervised learning is association analysis. As a part of association analysis, the association between data elements is identified. Let’s try to understand the approach of association analysis in context of one of the most common examples, i.e. market basket analysis as shown in Figure 1.9. From past transaction data in a grocery store, it may be observed that most of the customers who have bought item A, have also bought item B and item C or at least one of them. This means that there is a strong association of the event ‘purchase of item A’ with the event ‘purchase of item B’, or ‘purchase of item C’. Identifying these sorts of associations is the goal of association analysis. This helps in boosting up sales pipeline, hence a critical input for the sales group. Critical applications of association analysis include market basket analysis and recommender systems. FIG. 1.8 Unsupervised learning FIG. 1.9 Market basket analysis 1.5.3 Reinforcement learning We have seen babies learn to walk without any prior knowledge of how to do it. Often we wonder how they really do it. They do it in a relatively simple way. First they notice somebody else walking around, for example parents or anyone living around. They understand that legs have to be used, one at a time, to take a step. While walking, sometimes they fall down hitting an obstacle, whereas other times they are able to walk smoothly avoiding bumpy obstacles. When they are able to walk overcoming the obstacle, their parents are elated and appreciate the baby with loud claps / or may be a chocolates. When they fall down while circumventing an obstacle, obviously their parents do not give claps or chocolates. Slowly a time comes when the babies learn from mistakes and are able to walk with much ease. In the same way, machines often learn to do tasks autonomously. Let’s try to understand in context of the example of the child learning to walk. The action tried to be achieved is walking, the child is the agent and the place with hurdles on which the child is trying to walk resembles the environment. It tries to improve its performance of doing the task. When a sub-task is accomplished successfully, a reward is given. When a sub-task is not executed correctly, obviously no reward is given. This continues till the machine is able to complete execution of the whole task. This process of learning is known as reinforcement learning. Figure 1.10 captures the high-level process of reinforcement learning. FIG. 1.10 Reinforcement learning One contemporary example of reinforcement learning is self-driving cars. The critical information which it needs to take care of are speed and speed limit in different road segments, traffic conditions, road conditions, weather conditions, etc. The tasks that have to be taken care of are start/stop, accelerate/decelerate, turn to left / right, etc. Further details on reinforcement learning have been kept out of the scope of this book. Points to Ponder: Reinforcement learning is getting more and more attention from both industry and academia. Annual publications count in the area of reinforcement learning in Google Scholar support this view. While Deep Blue used brute force to defeat the human chess champion, AlphaGo used RL to defeat the best human Go player. RL is an effective tool for personalized online marketing. It considers the demographic details and browsing history of the user real-time to show most relevant advertisements. 1.5.4 Comparison – supervised, unsupervised, and reinforcement learning 1.6 PROBLES NOT TO BE SOLVED USING MACHINE LEARNING Machine learning should not be applied to tasks in which humans are very effective or frequent human intervention is needed. For example, air traffic control is a very complex task needing intense human involvement. At the same time, for very simple tasks which can be implemented using traditional programming paradigms, there is no sense of using machine learning. For example, simple rule-driven or formula-based applications like price calculator engine, dispute tracking application, etc. do not need machine learning techniques. Machine learning should be used only when the business process has some lapses. If the task is already optimized, incorporating machine learning will not serve to justify the return on investment. For situations where training data is not sufficient, machine learning cannot be used effectively. This is because, with small training data sets, the impact of bad data is exponentially worse. For the quality of prediction or recommendation to be good, the training data should be sizeable. 1.7 APPLICATIONS OF MACHINE LEARNING Wherever there is a substantial amount of past data, machine learning can be used to generate actionable insight from the data. Though machine learning is adopted in multiple forms in every business domain, we have covered below three major domains just to give some idea about what type of actions can be done using machine learning. 1.7.1 Banking and finance In the banking industry, fraudulent transactions, especially the ones related to credit cards, are extremely prevalent. Since the volumes as well as velocity of the transactions are extremely high, high performance machine learning solutions are implemented by almost all leading banks across the globe. The models work on a real-time basis, i.e. the fraudulent transactions are spotted and prevented right at the time of occurrence. This helps in avoiding a lot of operational hassles in settling the disputes that customers will otherwise raise against those fraudulent transactions. Customers of a bank are often offered lucrative proposals by other competitor banks. Proposals like higher bank interest, lower processing charge of loans, zero balance savings accounts, no overdraft penalty, etc. are offered to customers, with the intent that the customer switches over to the competitor bank. Also, sometimes customers get demotivated by the poor quality of services of the banks and shift to competitor banks. Machine learning helps in preventing or at least reducing the customer churn. Both descriptive and predictive learning can be applied for reducing customer churn. Using descriptive learning, the specific pockets of problem, i.e. a specific bank or a specific zone or a specific type of offering like car loan, may be spotted where maximum churn is happening. Quite obviously, these are troubled areas where further investigation needs to be done to find and fix the root cause. Using predictive learning, the set of vulnerable customers who may leave the bank very soon, can be identified. Proper action can be taken to make sure that the customers stay back. 1.7.2 Insurance Insurance industry is extremely data intensive. For that reason, machine learning is extensively used in the insurance industry. Two major areas in the insurance industry where machine learning is used are risk prediction during new customer onboarding and claims management. During customer onboarding, based on the past information the risk profile of a new customer needs to be predicted. Based on the quantum of risk predicted, the quote is generated for the prospective customer. When a customer claim comes for settlement, past information related to historic claims along with the adjustor notes are considered to predict whether there is any possibility of the claim to be fraudulent. Other than the past information related to the specific customer, information related to similar customers, i.e. customer belonging to the same geographical location, age group, ethnic group, etc., are also considered to formulate the model. 1.7.3 Healthcare Wearable device data form a rich source for applying machine learning and predict the health conditions of the person real time. In case there is some health issue which is predicted by the learning model, immediately the person is alerted to take preventive action. In case of some extreme problem, doctors or healthcare providers in the vicinity of the person can be alerted. Suppose an elderly person goes for a morning walk in a park close to his house. Suddenly, while walking, his blood pressure shoots up beyond a certain limit, which is tracked by the wearable. The wearable data is sent to a remote server and a machine learning algorithm is constantly analyzing the streaming data. It also has the history of the elderly person and persons of similar age group. The model predicts some fatality unless immediate action is taken. Alert can be sent to the person to immediately stop walking and take rest. Also, doctors and healthcare providers can be alerted to be on standby. Machine learning along with computer vision also plays a crucial role in disease diagnosis from medical imaging. 1.8 STATE-OF-THE-ART LANGUAGES/TOOLS IN MACHINE LEARNING The algorithms related to different machine learning tasks are known to all and can be implemented using any language/platform. It can be implemented using a Java platform or C / C++ language or in.NET. However, there are certain languages and tools which have been developed with a focus for implementing machine learning. Few of them, which are most widely used, are covered below. 1.8.1 Python Python is one of the most popular, open source programming language widely adopted by machine learning community. It was designed by Guido van Rossum and was first released in 1991. The reference implementation of Python, i.e. CPython, is managed by Python Software Foundation, which is a non- profit organization. Python has very strong libraries for advanced mathematical functionalities (NumPy), algorithms and mathematical tools (SciPy) and numerical plotting (matplotlib). Built on these libraries, there is a machine learning library named scikit- learn, which has various classification, regression, and clustering algorithms embedded in it. 1.8.2 R R is a language for statistical computing and data analysis. It is an open source language, extremely popular in the academic community – especially among statisticians and data miners. R is considered as a variant of S, a GNU project which was developed at Bell Laboratories. Currently, it is supported by the R Foundation for statistical computing. R is a very simple programming language with a huge set of libraries available for different stages of machine learning. Some of the libraries standing out in terms of popularity are plyr/dplyr (for data transformation), caret (‘Classification and Regression Training’ for classification), RJava (to facilitate integration with Java), tm (for text mining), ggplot2 (for data visualization). Other than the libraries, certain packages like Shiny and R Markdown have been developed around R to develop interactive web applications, documents and dashboards on R without much effort. 1.8.3 Matlab MATLAB (matrix laboratory) is a licenced commercial software with a robust support for a wide range of numerical computing. MATLAB has a huge user base across industry and academia. MATLAB is developed by MathWorks, a company founded in 1984. Being proprietary software, MATLAB is developed much more professionally, tested rigorously, and has comprehensive documentation. MATLAB also provides extensive support of statistical functions and has a huge number of machine learning algorithms in-built. It also has the ability to scale up for large datasets by parallel processing on clusters and cloud. 1.8.4 SAS SAS (earlier known as ‘Statistical Analysis System’) is another licenced commercial software which provides strong support for machine learning functionalities. Developed in C by SAS Institute, SAS had its first release in the year 1976. SAS is a software suite comprising different components. The basic data management functionalities are embedded in the Base SAS component whereas the other components like SAS/INSIGHT, Enterprise Miner, SAS/STAT, etc. help in specialized functions related to data mining and statistical analysis. 1.8.5 Other languages/tools There are a host of other languages and tools that also support machine learning functionalities. Owned by IBM, SPSS (originally named as Statistical Package for the Social Sciences) is a popular package supporting specialized data mining and statistical analysis. Originally popular for statistical analysis in social science (as the name reflects), SPSS is now popular in other fields as well. Released in 2012, Julia is an open source, liberal licence programming language for numerical analysis and computational science. It has baked in all good things of MATLAB, Python, R, and other programming languages used for machine learning for which it is gaining steady attention from machine learning development community. Another big point in favour of Julia is its ability to implement high- performance machine learning algorithms. 1.9 ISSUES IN MACHINE LEARNING Machine learning is a field which is relatively new and still evolving. Also, the level of research and kind of use of machine learning tools and technologies varies drastically from country to country. The laws and regulations, cultural background, emotional maturity of people differ drastically in different countries. All these factors make the use of machine learning and the issues originating out of machine learning usage are quite different. The biggest fear and issue arising out of machine learning is related to privacy and the breach of it. The primary focus of learning is on analyzing data, both past and current, and coming up with insight from the data. This insight may be related to people and the facts revealed might be private enough to be kept confidential. Also, different people have a different preference when it comes to sharing of information. While some people may be open to sharing some level of information publicly, some other people may not want to share it even to all friends and keep it restricted just to family members. Classic examples are a birth date (not the day, but the date as a whole), photographs of a dinner date with family, educational background, etc. Some people share them with all in the social platforms like Facebook while others do not, or if they do, they may restrict it to friends only. When machine learning algorithms are implemented using those information, inadvertently people may get upset. For example, if there is a learning algorithm to do preference-based customer segmentation and the output of the analysis is used for sending targeted marketing campaigns, it will hurt the emotion of people and actually do more harm than good. In certain countries, such events may result in legal actions to be taken by the people affected. Even if there is no breach of privacy, there may be situations where actions were taken based on machine learning may create an adverse reaction. Let’s take the example of knowledge discovery exercise done before starting an election campaign. If a specific area reveals an ethnic majority or skewness of a certain demographic factor, and the campaign pitch carries a message keeping that in mind, it might actually upset the voters and cause an adverse result. So a very critical consideration before applying machine learning is that proper human judgement should be exercised before using any outcome from machine learning. Only then the decision taken will be beneficial and also not result in any adverse impact. 1.10 SUMMARY Machine learning imbibes the philosophy of human learning, i.e. learning from expert guidance and from experience. The basic machine learning process can be divided into three parts. Data Input: Past data or information is utilized as a basis for future decision-making. Abstraction: The input data is represented in a summarized way Generalization: The abstracted representation is generalized to form a framework for making decisions. Before starting to solve any problem using machine learning, it should be decided whether the problem is a right problem to be solved using machine learning. Machine learning can be classified into three broad categories: Supervised learning: Also called predictive learning. The objective of this learning is to predict class/value of unknown objects based on prior information of similar objects. Examples: predicting whether a tumour is malignant or benign, price prediction in domains such as real estate, stocks, etc. Unsupervised learning: Also called descriptive learning, helps in finding groups or patterns in unknown objects by grouping similar objects together. Examples: customer segmentation, recommender systems, etc. Reinforcement learning: A machine learns to act on its own to achieve the given goals. Examples: self-driving cars, intelligent robots, etc. Machine learning has been adopted by various industry domains such as Banking and Financial Services, Insurance, Healthcare, Life Sciences, etc. to solve problems. Some of the most adopted platforms to implement machine learning include Python, R, MATLAB, SAS, SPPSS, etc. To avoid ethical issues, the critical consideration is required before applying machine learning and using any outcome from machine learning. SAMPLE QUESTIONS MULTIPLE-CHOICE QUESTIONS (1 MARK EACH): 1. Machine learning is ___ field. 1. Inter-disciplinary 2. Single 3. Multi-disciplinary 4. All of the above 2. A computer program is said to learn from __________ E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with E. 1. Training 2. Experience 3. Database 4. Algorithm 3. __________ has been used to train vehicles to steer correctly and autonomously on road. 1. Machine learning 2. Data mining 3. Neural networks 4. Robotics 4. Any hypothesis find an approximation of the target function over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. This is called _____. 1. Hypothesis 2. Inductive hypothesis 3. Learning 4. Concept learning 5. Factors which affect performance of a learner system does not include 1. Representation scheme used 2. Training scenario 3. Type of feedback 4. Good data structures 6. Different learning methods does not include 1. Memorization 2. Analogy 3. Deduction 4. Introduction 7. A model of language consists of the categories which does not include 1. Language units 2. Role structure of units 3. System constraints 4. Structural units 8. How many types are available in machine learning? 1. 1 2. 2 3. 3 4. 4 9. The k-means algorithm is a 1. Supervised learning algorithm 2. Unsupervised learning algorithm 3. Semi-supervised learning algorithm 4. Weakly supervised learning algorithm 10. The Q-learning algorithm is a 1. Supervised learning algorithm 2. Unsupervised learning algorithm 3. Semi-supervised learning algorithm 4. Reinforcement learning algorithm 11. This type of learning to be used when there is no idea about the class or label of a particular data 1. Supervised learning algorithm 2. Unsupervised learning algorithm 3. Semi-supervised learning algorithm 4. Reinforcement learning algorithm 12. The model learns and updates itself through reward/punishment in case of 1. Supervised learning algorithm 2. Unsupervised learning algorithm 3. Semi-supervised learning algorithm 4. Reinforcement learning algorithm SHORT-ANSWER TYPE QUESTIONS (5 MARKS EACH): 1. What is human learning? Give any two examples. 2. What are the types of human learning? Are there equivalent forms of machine learning? 3. What is machine learning? What are key tasks of machine learning? 4. Explain the concept of penalty and reward in reinforcement learning. 5. What are concepts of learning as search? 6. What are different objectives of machine learning? How are these related with human learning? 7. Define machine learning and explain the different elements with a real example. 8. Explain the process of abstraction with an example. 9. What is generalization? What role does it play in the process of machine learning? 10. What is classification? Explain the key differences between classification and regression. 11. What is regression? Give example of some practical problems solved using regression. 12. Explain the process of clustering in details. 13. Write short notes on any two of the following: 1. Application of machine learning algorithms 2. Supervised learning 3. Unsupervised learning 4. Reinforcement learning LONG-ANSWER TYPE QUESTIONS (10 MARKS QUESTIONS): 1. What is machine learning? Explain any two business applications of machine learning. What are the possible ethical issues of machine learning applications? 2. Explain how human learning happens: 1. Under direct guidance of experts 2. Under indirect guidance of experts 3. Self-learning 3. Explain the different forms of machine learning with a few examples. 4. Compare the different types of machine learning. 5. What do you mean by a well-posed learning problem? Explain important features that are required to well-define a learning problem. 6. Can all problems be solved using machine learning? Explain your response in detail. 7. What are different tools and technologies available for solving problems in machine learning? Give details about any two of them. 8. What are the different types of supervised learning? Explain them with a sample application in each area. 9. What are the different types of unsupervised learning? Explain them difference with a sample application in each area. 10. Explain, in details, the process of machine learning. 1. Write short notes on any two: 1. MATLAB 2. Application of machine learning in Healthcare 3. Market basket analysis 4. Simple linear regression 2. Write the difference between (any two): 1. Abstraction and generalization 2. Supervised and unsupervised learning 3. Classification and regression Chapter 2 Preparing to Model OBJECTIVE OF THE CHAPTER: This chapter gives a detailed view of how to understand the incoming data and create basic understanding about the nature and quality of the data. This information, in turn, helps to select and then how to apply the model. So, the knowledge imparted in this chapter helps a beginner take the first step towards effective modelling and solving a machine learning problem. 2.1 INTRODUCTION In the last chapter, we got introduced to machine learning. In the beginning, we got a glimpse of the journey of machine learning as an evolving technology. It all started as a proposition from the renowned computer scientist Alan Turing – machines can ‘learn’ and become artificially intelligent. Gradually, through the next few decades path-breaking innovations came in from Arthur Samuel, Frank Rosenblatt, John Hopfield, Christopher Watkins, Geoffrey Hinton and many other computer scientists. They shaped up concepts of Neural Networks, Recurrent Neural Network, Reinforcement Learning, Deep Learning, etc. which took machine learning to new heights. In parallel, interesting applications of machine learning kept on happening, with organizations like IBM and Google taking a lead. What started with IBM’s Deep Blue beating the world chess champion Gary Kasparov, continued with IBM’s Watson beating two human champions in a Jeopardy competition.Google also started with a series of innovations applying machine learning. The Google Brain, Sibyl, Waymo, AlphaGo programs – are all extremely advanced applications of machine learning which have taken the technology a few notches up. Now we can see an all- pervasive presence of machine learning technology in all walks of life. We have also seen the types of human learning and how that, in some ways, can be related to the types of machine learning – supervised, unsupervised, and reinforcement. Supervised learning, as we saw, implies learning from past data, also called training data, which has got known values or classes. Machines can ‘learn’ or get ‘trained’ from the past data and assign classes or values to unknown data, termed as test data. This helps in solving problems related to prediction. This is much like human learning through expert guidance as happens for infants from parents or students through teachers. So, supervised learning in case of machines can be perceived as guided learning from human inputs. Unsupervised machine learning doesn’t have labelled data to learn from. It tries to find patterns in unlabelled data. This is much like human beings trying to group together objects of similar shape. This learning is not guided by labelled inputs but uses the knowledge gained from the labels themselves. Last but not the least is reinforcement learning in which machine tries to learn by itself through penalty/ reward mechanism – again pretty much in the same way as human self-learning happens. Lastly, we saw some of the applications of machine learning in different domains such as banking and finance, insurance, and healthcare. Fraud detection is a critical business case which is implemented in almost all banks across the world and uses machine learning predominantly. Risk prediction for new customers is a similar critical case in the insurance industry which finds the application of machine learning. In the healthcare sector, disease prediction makes wide use of machine learning, especially in the developed countries. While development in machine learning technology has been extensive and its implementation has become widespread, to start as a practitioner, we need to gain some basic understanding. We need to understand how to apply the array of tools and technologies available in the machine learning to solve a problem. In fact, that is going to be very specific to the kind of problem that we are trying to solve. If it is a prediction problem, the kind of activities that will be involved is going to be completely different vis-à-vis if it is a problem where we are trying to unfold a pattern in a data without any past knowledge about the data. So how a machine learning project looks like or what are the salient activities that form the core of a machine learning project will depend on whether it is in the area of supervised or unsupervised or reinforcement learning area. However, irrespective of the variation, some foundational knowledge needs to be built before we start with the core machine learning concepts and key algorithms. In this section, we will have a quick look at a few typical machine learning activities and focus on some of the foundational concepts that all practitioners need to gain as pre-requisites before starting their journey in the area of machine learning. Points to Ponder No man is perfect. The same is applicable for machines. To increase the level of accuracy of a machine, human participation should be added to the machine learning process. In short, incorporating human intervention is the recipe for the success of machine learning. 2.2 MACHINE LEARNING ACTIVITIES The first step in machine learning activity starts with data. In case of supervised learning, it is the labelled training data set followed by test data which is not labelled. In case of unsupervised learning, there is no question of labelled data but the task is to find patterns in the input data. A thorough review and exploration of the data is needed to understand the type of the data, the quality of the data and relationship between the different data elements. Based on that, multiple pre-processing activities may need to be done on the input data before we can go ahead with core machine learning activities. Following are the typical preparation activities done once the input data comes into the machine learning system: Understand the type of data in the given input data set. Explore the data to understand the nature and quality. Explore the relationships amongst the data elements, e.g. inter-feature relationship. Find potential issues in data. Do the necessary remediation, e.g. impute missing data values, etc., if needed. Apply pre-processing steps, as necessary. Once the data is prepared for modelling, then the learning tasks start off. As a part of it, do the following activities: The input data is first divided into parts – the training data and the test data (called holdout). This step is applicable for supervised learning only. Consider different models or learning algorithms for selection. Train the model based on the training data for supervised learning problem and apply to unknown data. Directly apply the chosen unsupervised model on the input data for unsupervised learning problem. After the model is selected, trained (for supervised learning), and applied on input data, the performance of the model is evaluated. Based on options available, specific actions can be taken to improve the performance of the model, if possible. Figure 2.1 depicts the four-step process of machine learning. FIG. 2.1 Detailed process of machine learning Table 2.1 contains a summary of steps and activities involved: Table 2.1 Activities in Machine Learning In this chapter, we will cover the first part, i.e. preparing to model. The remaining parts, i.e. learning, performance evaluation, and performance improvement will be covered in Chapter 3. 2.3 BASIC TYPES OF DATA IN MACHINE LEARNING Before starting with types of data, let’s first understand what a data set is and what are the elements of a data set. A data set is a collection of related information or records. The information may be on some entity or some subject area. For example (Fig. 2.2), we may have a data set on students in which each record consists of information about a specific student. Again, we can have a data set on student performance which has records providing performance, i.e. marks on the individual subjects. Each row of a data set is called a record. Each data set also has multiple attributes, each of which gives information on a specific characteristic. For example, in the data set on students, there are four attributes namely Roll Number, Name, Gender, and Age, each of which understandably is a specific characteristic about the student entity. Attributes can also be termed as feature, variable, dimension or field. Both the data sets, Student and Student Performance, are having four features or dimensions; hence they are told to have four- dimensional data space. A row or record represents a point in the four-dimensional data space as each row has specific values for each of the four attributes or features. Value of an attribute, quite understandably, may vary from record to record. For example, if we refer to the first two records in the Student data set, the value of attributes Name, Gender, and Age are different (Fig. 2.3). FIG. 2.2 Examples of data set FIG. 2.3 Data set records and attributes Now that a context of data sets is given, let’s try to understand the different types of data that we generally come across in machine learning problems. Data can broadly be divided into following two types: 1. Qualitative data 2. Quantitative data Qualitative data provides information about the quality of an object or information which cannot be measured. For example, if we consider the quality of performance of students in terms of ‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data. Also, name or roll number of students are information that cannot be measured using some scale of measurement. So they would fall under qualitative data. Qualitative data is also called categorical data. Qualitative data can be further subdivided into two types as follows: 1. Nominal data 2. Ordinal data Nominal data is one which has no numeric value, but a named value. It is used for assigning named values to attributes. Nominal values cannot be quantified. Examples of nominal data are 1. Blood group: A, B, O, AB, etc. 2. Nationality: Indian, American, British, etc. 3. Gender: Male, Female, Other Note: A special case of nominal data is when only two labels are possible, e.g. pass/fail as a result of an examination. This sub-type of nominal data is called ‘dichotomous’. It is obvious, mathematical operations such as addition, subtraction, multiplication, etc. cannot be performed on nominal data. For that reason, statistical functions such as mean, variance, etc. can also not be applied on nominal data. However, a basic count is possible. So mode, i.e. most frequently occurring value, can be identified for nominal data. Ordinal data, in addition to possessing the properties of nominal data, can also be naturally ordered. This means ordinal data also assigns named values to attributes but unlike nominal data, they can be arranged in a sequence of increasing or decreasing value so that we can say whether a value is better than or greater than another value. Examples of ordinal data are 1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc. 2. Grades: A, B, C, etc. 3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc. Like nominal data, basic counting is possible for ordinal data. Hence, the mode can be identified. Since ordering is possible in case of ordinal data, median, and quartiles can be identified in addition. Mean can still not be calculated. Quantitative data relates to information about the quantity of an object – hence it can be measured. For example, if we consider the attribute ‘marks’, it can be measured using a scale of measurement. Quantitative data is also termed as numeric data. There are two types of quantitative data: 1. Interval data 2. Ratio data Interval data is numeric data for which not only the order is known, but the exact difference between values is also known. An ideal example of interval data is Celsius temperature. The difference between each value remains the same in Celsius temperature. For example, the difference between 12°C and 18°C degrees is measurable and is 6°C as in the case of difference between 15.5°C and 21.5°C. Other examples include date, time, etc. For interval data, mathematical operations such as addition and subtraction are possible. For that reason, for interval data, the central tendency can be measured by mean, median, or mode. Standard deviation can also be calculated. However, interval data do not have something called a ‘true zero’ value. For example, there is nothing called ‘0 temperature’ or ‘no temperature’. Hence, only addition and subtraction applies for interval data. The ratio cannot be applied. This means, we can say a temperature of 40°C is equal to the temperature of 20°C + temperature of 20°C. However, we cannot say the temperature of 40°C means it is twice as hot as in temperature of 20°C. Ratio data represents numeric data for which exact value can be measured. Absolute zero is available for ratio data. Also, these variables can be added, subtracted, multiplied, or divided. The central tendency can be measured by mean, median, or mode and methods of dispersion such as standard deviation. Examples of ratio data include height, weight, age, salary, etc. Figure 2.4 gives a summarized view of different types of data that we may find in a typical machine learning problem. FIG. 2.4 Types of data Apart from the approach detailed above, attributes can also be categorized into types based on a number of values that can be assigned. The attributes can be either discrete or continuous based on this factor. Discrete attributes can assume a finite or countably infinite number of values. Nominal attributes such as roll number, street number, pin code, etc. can have a finite number of values whereas numeric attributes such as count, rank of students, etc. can have countably infinite values. A special type of discrete attribute which can assume two values only is called binary attribute. Examples of binary attribute include male/ female, positive/negative, yes/no, etc. Continuous attributes can assume any possible value which is a real number. Examples of continuous attribute include length, height, weight, price