Machine Learning 600 Study Guide PDF
Document Details
Uploaded by RighteousChocolate8802
Tags
Related
- Machine-Learning-Guided Peptide Drug Discovery: Development of GLP-1 Receptor Agonists PDF
- Mathematics for Machine Learning PDF
- Priručnik za pripremu prijemnog ispita iz veštačke inteligencije za master studije Softversko inženjerstvo i veštačka inteligencija PDF
- Artificial Intelligence and Machine Learning Applications in Smart Production (PDF)
- Study Material ML -Unit 3 PDF
- House Price Prediction Using Machine Learning - A Review
Summary
This study guide provides a general overview of machine learning concepts, including supervised and unsupervised learning, project cycles, data acquisition, and various algorithms. It covers topics like linear regression, decision trees, clustering, and neural networks. This guide, for a Bachelor of Science in Information Technology, is a helpful resource for students preparing for their machine learning course.
Full Transcript
Bachelor of Science in Information Technology Machine Learning 600 Year 2 Table of Contents Chapter 1: Introduction to Machine Learning................................................................................ 2 1.1 History of Machine Learning...
Bachelor of Science in Information Technology Machine Learning 600 Year 2 Table of Contents Chapter 1: Introduction to Machine Learning................................................................................ 2 1.1 History of Machine Learning.......................................................................................... 2 1.2 Type of Machine Learning.............................................................................................. 4 1.2.1 Supervised Learning................................................................................................... 4 1.2.2 Unsupervised Learning.............................................................................................. 5 1.3 Applicability of Machine Learning Techniques................................................................ 6 1.4 Types of Machine Learning Languages & Data Repositories............................................ 6 1.5 Data Repositories.......................................................................................................... 8 1.6 Summary...................................................................................................................... 9 1.7 Review Questions............................................................................................................. 10 1.8 MCQs (Quick Quiz)...................................................................................................... 11 Chapter 2: The Machine Learning Project Cycle & Data Acquisition Technique............................. 12 2.1 ML Cycle........................................................................................................................... 12 2.1.1 Transfer Learning........................................................................................................... 13 2.1.2 One solution fits all........................................................................................................ 13 2.2 Defining the process......................................................................................................... 14 2.2 Building a Data Team........................................................................................................ 15 2.2.1 Data Processing............................................................................................................. 15 2.2.2 Data Storage.................................................................................................................. 15 2.2.3 Data Privacy.................................................................................................................. 15 2.3 Data Quality and Cleaning................................................................................................. 15 2.4 ML Experiments................................................................................................................ 17 2.5 Planning........................................................................................................................... 18 2.6 Scrapping Data............................................................................................................ 18 2.6.1 Copy and Paste........................................................................................................ 18 2.6.2 Using an API............................................................................................................ 20 2.7 Data Migration............................................................................................................ 21 2.8 Summary.................................................................................................................... 21 2.9 Review Questions........................................................................................................ 22 2.10 Review Questions........................................................................................................ 22 2.10 MCQs (Quick Quiz)...................................................................................................... 23 2.11 MCQs (Quick Quiz)...................................................................................................... 24 Chapter 3: Statistics, Randomness and Linear Regression in ML................................................... 25 3.1 Working with ML datasets........................................................................................... 25 3.1.1 Using Java to load ML datasets................................................................................ 25 3.1.2 Basic statistics......................................................................................................... 26 3.2 Linear Regression........................................................................................................ 28 3.2.1 Scatter Plots............................................................................................................ 30 3.2.2 Trendline................................................................................................................. 30 3.2.3 Prediction................................................................................................................ 31 3.3 Programing................................................................................................................. 31 3.4 Randomness............................................................................................................... 32 3.5 Summary.................................................................................................................... 34 3.6 Review Questions........................................................................................................ 35 3.7 MCQs (Quick Quiz)...................................................................................................... 36 Chapter 4: Decision Trees & Clustering....................................................................................... 37 4.1 Introduction to Basics Decision Trees........................................................................... 37 4.1.1 Why using Decision Trees?....................................................................................... 38 4.1.2 Disadvantages of using Decision Trees..................................................................... 38 4.1.3 Decision Trees Types................................................................................................ 38 4.1.4 The intuition behind Decision Trees......................................................................... 39 4.2 Entropy Computation.................................................................................................. 41 4.2.1 Information Gain Computation................................................................................ 41 4.3 Clustering.................................................................................................................... 42 4.3.1 Why using Clustering?............................................................................................. 42 4.3.2 Clustering Models.................................................................................................... 43 4.3.3 K-Means Algorithms................................................................................................ 43 4.4 Cross Validation.......................................................................................................... 45 4.4.1 Silhouette Method.................................................................................................. 45 4.4.2 Data Visualization.................................................................................................... 46 4.5 Summary.................................................................................................................... 46 4.6 Review Questions........................................................................................................ 48 4.7 MCQs (Quick Quiz)...................................................................................................... 49 Chapter 5: Association Rules Learning - Support Vectors Machine & Neural Networks................. 50 5.1 Association Rules........................................................................................................ 50 5.1.1 Web Mining............................................................................................................. 51 5.2 Support Vectors Machine............................................................................................ 53 5.3 Why using SVM?......................................................................................................... 53 5.3.1 SVM classification principle...................................................................................... 54 5.4 Neural Networks......................................................................................................... 59 5.4.1 Data Preparation for ANN........................................................................................ 64 5.5 Summary.................................................................................................................... 64 5.6 Review Questions........................................................................................................ 65 5.7 MCQs (Quick Quiz)...................................................................................................... 66 Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing....... 67 6.1 Preparing Text for Analysis.......................................................................................... 67 6.2 Stopwords.................................................................................................................. 68 6.3 Stemming................................................................................................................... 68 6.4 N-grams...................................................................................................................... 69 6.5 TF/IDF......................................................................................................................... 69 6.6 Image Processing in ML............................................................................................... 71 6.6.1 Color Depth............................................................................................................. 72 6.6.2 Images in ML........................................................................................................... 73 6.7 Convolutional Neural Network (CNN).......................................................................... 73 6.7.1 Feature Extraction................................................................................................... 74 6.7.2 Classification........................................................................................................... 76 6.8 CNN & Transfer Learning............................................................................................. 77 6.9 Assessing the performance of ML algorithms............................................................... 77 6.9.1 Classification and Confusion Matrix......................................................................... 78 6.9.2 Regularization......................................................................................................... 81 6.10 Summary.................................................................................................................... 83 6.11 Review Questions........................................................................................................... 84 6.12MCQs (Quick Quiz)........................................................................................................... 85 PRESCRIBED OR RECOMMENDED BOOKS Machine Learning: Hands-On For Developers and Technical Professionals Second Edition ISBN: 978-1-119- 64219-0 AUTHOR: Jason Bell - ©2020 Introducing Machine Learning ISBN: 9780135588383 AUTHORS: Dino Esposito & Francesco Esposito - ©2020 Page |1 Chapter 1: Introduction to Machine Learning Chapter 2: The Machine Learning Project Cycle & Data Acquisition Techniques Chapter 3: Statistics, Randomness & Linear Regression in ML Chapter 4: Decision Trees and Clustering Chapter 5: Association Rules Learning - Support Vectors machine & Neural Network Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing Page |2 Chapter 1: Introduction to Machine Learning LEARNING OUTCOMES After reading this Section of the guide, the learner should be able to: Understand the concept of Machine Learning (ML) Understand the need for ML (Applicability) Differentiate between Ml objects and Algorithms types Understand the genesis of ML 1.1 History of Machine Learning Machine learning (ML) is a sub field of AI (artificial intelligence) that utilises computing capabilities to learn from structured or unstructured data to predict or classify various features. The ML model aim is to learn and enhance or evolve with time and, with experience. An example of a ML model is a system that could use previous learning experiences to predict outcomes of current experiences. For instance, recommender system, image recognition, medical diagnosis prediction, speech recognition, language translation and more. Over the past few years, different scientists have worked to steer us in the right direction of ML: Alan Turing. In 1950 Turing asked if machines can really think and this was the beginning of AI which will yield to the birth of ML (see Figure 1.1). Arthur Samuel. Nine years later after Turing, Samuel defined ML as a scientific field that provides machines or computers with the capacity to learn without being programed. He is the first scientist to implement such self-learning computer program. Tom M. Mitchell. Published a ML book in 1997 where he defined ML as follows: Page |3 A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with the experience E. Figure 1.1: The Turing Test In a Turing test, the judge would type into a terminal program to “talk” to the other two participants. Both the human and the computer would respond, and the judge would decide which response came from the computer. If the judge couldn't consistently tell the difference between the human and computer responses, then the computer won the game. The test continues today in the form of the Loebner Prize, an annual competition in AI. The aim is simple enough: convince the judges that they are chatting to a human instead of a computer chat bot program. Machine Learning objects. We need three objects in the definition of ML. E for experience: this represent the learning process of ML algorithms. P for performance: this is important to assess ML models and algorithms. T for task: perhaps one or more. In short, with a computer executing various T, the E should provide an increase of P. Page |4 1.2 Type of Machine Learning There exists a myriad of ML algorithms but the selection choice will rely on the type of outputs required for the system. ML algorithms are stratified into two categories: unsupervised and supervised learning algorithms. 1.2.1 Supervised Learning Supervised learning works with a set of labelled features (input-output) named training dataset. Every observation in the training dataset must have an input and output object. The supervised ML algorithm will use this training set to predict or classify unknown features during the testing phase and observations that was not considered or included in the training dataset will be classified as unknown instances. An example is given in Figure 1.2. Figure 1.2: Supervised learning There are problems with this ML technique: Bias-variance dilemma: how the ML model performs accurately using different training sets? Page |5 High-bias models: how to deal with complex and noisy training dataset? Size of the training dataset: what is the acceptable size of the training dataset for the ML algorithm to compute accurate classification and prediction? 1.2.2 Unsupervised Learning The opposite of supervised learning is unsupervised learning. With unsupervised learning, the ML algorithm detect hidden patterns in a given dataset. With unsupervised learning there is no wrong or right answer; it's just a case of running the ML algorithm and seeing what patterns and outcomes occur. Unsupervised learning can be thought of as a learner without a teacher because there is no need for the training dataset. The unsupervised learning algorithm will stratify features based on similarity and dissimilarity of hidden patterns (see Figure 1.3). Figure 1.3: Unsupervised learning Page |6 1.3 Applicability of Machine Learning Techniques ML is widely utilised in software engineering to build an application or software enabling improved experience with the user. With some packages or libraries, the software or application will be able to learn about the user's behaviour. After the software or application has been in use for a period of time, it begins to predict what the user wants to do. For instance, Facebook has been using ML to automatically suggest friends and Netflix to recommend movies (see Table 1.1). Table 1.1: Machine learning in real life Software Description Spam detection To figure out whether an email is a spam or not Voice recognition Apple's Siri service that is on many iOS devices Stock trading Help users make better stock trades – make recommendations Robotics Self-driving cars & Sophia (speaking robot) Medicine & Diagnosis via X-ray and MRI scans to detect various cancers and other Healthcare disease Advertising Figuring out where a customer will be on a given day and attempt to push offers his/ her way E-commerce Buy and sell products online. E.g., Amazon Gaming Chess computer games with an artificial player that can beat a humankind 1.4 Types of Machine Learning Languages & Data Repositories There are different languages that can be used to implement a system based on ML. Java is a widely used language, especially in the enterprise, and the libraries are well supported. Clojure is another language that gives better data handling abilities: data goes into a function, and the result is output as data. Java isn't the only language to be used for ML. If you're working for an existing organization, you may be restricted to the languages used within it. These languages are depicted in Table 1.2. Page |7 Table 1.2: Machine Learning Languages Programing language Description Python Easy to learn & read, has ML libraries (scikit-learn & Tensorflow) R Open source & statistical programing language that has visualisation tools and ML packages Matlab Widely used in academia – it can be used for data visualisation and plotting graphs Scala-Scalable language A new breed of language that takes advantage of Java & increase performance based on threading. It includes Natural Language Processing (NLP) libraries such as ScalaNLP Ruby Web development framework & integrate ML libraries such as JRuby There are existing software that incorporate ML algorithms. With this, programmers or data scientists do not have to code the ML algorithm from scratch but they can use the software, load the dataset and invoked the ML library on the loaded dataset. Table 1.3 presents some of the well-known ML software. Page |8 Table 1.3: Machine Learning Software Software Description WEKA (Open source) Provide a set of ML algorithms & visualisation tools, can pre-process data, classify and predict - www.cs.waikato.ac.nz/ml/weka/downloading.html KafKa Use for streaming data Spark & Hadoop Use for processing Big Data - https://spark.apache.org. DeepLearning4J It can scale with Big Data - https://deeplearning4j.org. 1.5 Data Repositories Where can one get data? There are different answers to this question. Nevertheless, the correct answer will be based on the learning experience of the ML algorithm. Generally, data comes in all sizes and formats. More often the dataset is in a comma-separated variable (CSV) format, or XML and JSON format. One must have a question in mind that he/ she is attempting to answer using a specific dataset (this is a good start). For example, one would like to predict a student performance based on the module he/ she registered for. The dataset of previous students with their modules and marks should be provided and the ML algorithms will be trained on this dataset to predict current student’s performances based on previous experiences. Note that the ML algorithm learning-experience will come from improvement of features and experimentation on results. So, one must first play around with the dataset to understand its characteristics and seeing what ML techniques will be the most suitable with the problem at hand. Sometimes supervised learning can perform better than unsupervised learning or vice versa. One can also use a semi-supervised approach that combines both, supervised and unsupervised learning. One can also use Deep Learning which is similar to unsupervised learning. The following Table 1.4 portrays some directories where one can download plenty of ML datasets. Page |9 Table 1.4: Machine Learning Datasets Repositories Repository Description UC Irvine ML repository More than 270 datasets are available - http://archive.ics.uci.edu/ml/datasets. Kaggle ML datasets can be found at www.kaggle.com/competitions. 1.6 Summary This Chapter defined ML and its applicability in real life. The chapter covered various tools, datasets and languages used in the ML community. The next chapter introduces classification in ML and how one can plan for a ML project. It also covers different data science concepts such as data cleaning, and various techniques of processing data. P a g e | 10 1.7 Review Questions 1. What is the aim of a Turing test? 2. Briefly describe the difference between classification and prediction. 3. What is the difference between a feature and a dataset? 4. Describe the function of a ML dataset. 5. Briefly describe Natural Language Processing and AI. 6. Give an example of recommendation systems. 7. Beside CSV, JSON and XML formats; could you list three additional datasets formats used in the ML community? Read Recommended Reading: Bell, J., 2020. Machine learning: hands-on for developers and technical professionals. John Wiley & Sons. Esposito, D. and Esposito, F., 2020. Introducing Machine Learning. Microsoft Press. P a g e | 11 1.8 MCQs (Quick Quiz) 7. Which of the following formula best depict ML? 𝑇+𝐸 a) 𝑀𝐿 = 𝑃 𝑃+𝑇 b) 𝑀𝐿 = 𝐸 𝑃+𝐸 c) 𝑀𝐿 = 𝑇 𝑃∗𝑇∗𝐸 d) 𝑀𝐿 = 𝑃 8. What does LP in NLP stand for? a) Language Programing b) Language Program c) List Processing d) None of the above 9. Where can one get ML datasets? a) Internet b) Databases c) Kaggle d) All of the above 10. Which one represent an issue with supervised learning? a) Dataset size b) High-bias models c) a & b d) None of the above 11. Which one is not a ML objects? a) Task(s) b) Experience c) Performance d) Accuracy P a g e | 12 Chapter 2: The Machine Learning Project Cycle & Data Acquisition Technique LEARNING OUTCOMES After reading this Section of the guide, the learner should be able to: Understand the ML Project Cycle (ML process) Understand Data Processing & Privacy Understand Data Cleaning & Quality Understand Data Acquisition Process Understand Data Migration process 2.1 ML Cycle A ML project is a cycle of various actions to be performed: data can be acquired from different data sources. (It might be data that is held by your organization or publicly available data from the Internet). There might be one dataset, or there could be 10 or more. You must come to accept that data will need to be cleaned and checked for quality before any processing can take place. These processes occur during the prepare phase. The processing phase is where the work gets done. Finally, the results are presented. Reporting can happen in a variety of ways, such as reinvesting the data into a data store or reporting the results as a spreadsheet or report (see Figure 2.1 that delineates the concept). Figure 2.1: The ML process P a g e | 13 ML projects start with a question that needs to be investigated. Using a whiteboard, sticky notes, or even a sheet of paper, start asking questions like the following: Is there a correlation between student’s marks and the gender? Do sales on the weekend generate more revenue compared to the other days of the week? Can we plan what items to stock in the next five months by looking at social media (Twitter & Facebook)? Can we tell when students perform badly? All these examples are reasonable questions, and they also provide the basis for proper discussion. Stakeholders will usually come up with the questions, and then the data project team (which might be one person—you!) can spin into action. Without knowing the question, it's difficult to know where to start. 2.1.1 Transfer Learning With the plethora of ML now being implemented out in the field, it may be worth looking into existing models and modifying certain parameters settings to fit in with your prediction data, especially if you don't have much in the way of training data. This is called transfer learning. It's perfect for models that require large scale datasets for training, such as images, video, and large text corpus. 2.1.2 One solution fits all ML is built up from a varying set of tools, languages, and techniques. It's fair to say that there is no one solution that fits most projects. For example, there might be data in a relational database that needs extracting to a file before you can process it. Managers and developers can have joy and happiness when a data project is assigned. It's new, it's hip, and, it's funky to be working on data projects. Then after the scale of the project comes into focus, the colour drain from their faces. Usually this happens after the managers and developers see how many different elements are required to get things working for the project to succeed. And, like any major project, the specification from the stakeholders will change things along the way. P a g e | 14 2.2 Defining the process Making anything comes down to process, whether that's baking a cake, brewing a cup of coffee, or planning a ML project. Processes can be refined as time goes on, but if you've never developed one before, then you can use the following process as a template: Planning: projects start on paper, it doesn't work to jump in and code; that method is haphazard and error prone. You need to plan first. You can use A5 Moleskin notebooks for notes and use A4 and A3 artist drawing pads for large diagrams. Whiteboards are good, too. Planning consist of determining where the data will be extracted, how it will be cleaned, what ML algorithm to use and what is the expected output.1 Developing: code or algorithms development. It's worth using some form of code repository site like GitHub or Bitbucket. Testing: In this case, testing means testing with data. You might use a random sample of the data or the full set. The important thing is to remind yourself that you're testing the process, so it's okay for things to not go as planned. If you push things straight to production, then you won't really know what's going to happen. With testing you can get an idea of the pain points. You might find data-loading issues, data-processing issues, or answers that just don't make sense. When you test, you have time to change things. Reporting: sit down with the stakeholders and discuss the test results. Refining: refine code and, if possible, the algorithms. Production: be sure to give consideration to when this project will be run—is it an hourly/daily/weekly/monthly job? Avoiding Bias: It is important to get the teams talking to each other about how to avoid introducing any form of bias into the final solution. Dataset choice and ML techniques (supervised/ unsupervised are important). P a g e | 15 2.2 Building a Data Team A data scientist is someone who can bring the facets of data processing, analytics, statistics, programming, and visualization to a project. With so many skill sets in action, even for the smallest of projects, it's a lot to ask for one person to have all the necessary skills. A data science team is made of mathematicians, statisticians, graphic designers, programmers and domain experts depending of the project. 2.2.1 Data Processing After you have a team in place and a rough idea of how all of this is going to get put together, it's time to turn your attention to what is going to do all the work for you. You must give thought to the frequency of the data process jobs that will take place. If it will occur only once in a while, then it might be false economy investing in hardware over the long term. It makes more sense to start with what you have in hand and then add as you go along and as you notice growth in processing times and frequency. You can process data with your own computer, a cluster machines or use cloud-based services. 2.2.2 Data Storage This might be on a physical disc or deployed on a cloud-based solution. 2.2.3 Data Privacy In Europe, the General Data Protection Regulations control how business can use personal data within their organizations. Ultimately, with great power comes great responsibility, and it will be up to you how that data is protected and processed. 2.3 Data Quality and Cleaning In the real world, data is messy, usually unclean, and error-prone. The following sections offer some basic checks to do, and some sample data have been included so you can see clearly P a g e | 16 what to look for. The example data is a simple address book with a first name, last name, e- mail address, and age (see Table 2.1). Table 2.1: Sample data Presence Checks: First things first, check that data has been entered at all. The presence check is simple enough. If the field length is empty or null and that piece of data is important in the analysis, then you can't use records from which the data is missing. The first name and e-mail are missing from the example in Table 2.1, so the record should really be fixed or rejected. In theory, the data could be used if knowing the customer was not important. Type Checks: With relational databases you have schemas created, so there's already an expectation of what type of data is going where. If incorrect data is written to a field of a different data type, then the database engine will throw an error and complain at you. In text data, such as CSV files, that's not the case, so it's worth looking at each field and ensuring that what you're expecting to see is valid. From the example, you can see that the first row of data is correct, but the second is wrong because the first name field has a number in it and not a string type. There are a couple of things you could do here. The first option is to ignore the record, as it doesn't fit the data- quality check. The other option is to see if any other records have the same e-mail address and check the name against those records. Length Checks. Field lengths must be checked, too; once again, relational databases exercise a certain amount of control, but textual data can be error-prone if people don't go with the general rules of the schema (see Table 2.2). P a g e | 17 Table 2.2: Length checks Range Checks: Range or reasonableness checks are used with numeric or date ranges. Age ranges are the main talking point here. Until there are advances in scientific medicine to prolong life, you can make a fairly good assumption that the upper lifespan of someone is about 120. You can even play it safe and extend the upper range to 150; anyone who is older than that is lying or just trying to put a false value in to trip up the system (see Table 2.3). Table 2.3: Range Checks Format Checks: When you know that certain data must follow a given format, then it's always good to check it. Regular expression knowledge is a big advantage here if you know it. E-mail addresses can be used and abused in web forms and database tables, so it's always a good idea to validate what you can at the source. 2.4 ML Experiments It is safe to say that there is no “one solution fits all.” There are many components, formats, tools, and considerations to ponder on any project. In effect, every ML project starts with a clean sheet and communication among all involved, from stakeholders all the way through to visualization. Tools and scripts can be reused, but every case is going to be different, so things need minor adjustments as you go along. Don't be afraid to play around with data as you acquire it; see whether there's anything you can glean from it. It's also worth taking time to grab some open data and make your own scenarios and ask your own questions. P a g e | 18 It's like a musician practicing an instrument; it's worth putting in the hours so you are ready for the day when the big gig arrives. The ML community is large, and there are plenty of blog posts, articles, videos, and books produced by the community. Forums are the perfect place to swap stories and experiences, too. As with most things, the more you put in, the more you will get out of it. If you haven't looked at the likes of http://stackoverflow.com, a collaborative question-and-answer platform for software developers, then have a search around. Chances are that someone will have encountered the same problem as you. 2.5 Planning As with any project, planning is a key and essential part of ML and should not be taken lightly. This Chapter covers many aspects of planning, including processing, storage, privacy, and data cleaning. 2.6 Scrapping Data The question to ask is where is the data coming from and does it need transformation or cleaning? When it comes to ML and ML projects, you'll spend a large portion of your time on getting the data into the right shape so it can be processed. It is a dark art of ML that consist of Extracting, Transforming, and Loading (ETL) data. Processing scrapped data requires a few steps to get it from the usual messy state it's in to something usable: Figure out the source of data. Figure out the data extraction process. Make it in a readable format for the machine. Make sure the variable values are workable. Figure out where to store it. 2.6.1 Copy and Paste Data can be extracted from a web page or a series of web pages, and they tend to be a mess, but some are better than others. A first attempt would be to copy and paste the data from the page and then figure a way out to remove the HTML tags. There are, however, easier ways. Let's look at an example. Suppose we have been tasked with extracting airport data. P a g e | 19 We would like to see the busiest airports in the United Kingdom. We have found a page on Wikipedia, and we would like to get the data.2 (See Figure 2.2.) Figure 2.2: Data from Wikipedia Source: https://en.wikipedia.org/wiki/List_of_busiest_airports_in_the_United_Kingdom There are several tables that have the information we are looking for. If we copy and paste the 2017–2018 data/ information into a text file, the output is okay but needs cleaning (see Figure 2.3). The data can also be stored into an excel spreadsheet (see Figure 2.4) P a g e | 20 Figure 2.3: Copy and paste data into a textile Figure 2.4: Copy and paste data into an excel spreadsheet 2.6.2 Using an API An API is a set of routines supplied by a system or a website that lets you request and receive data or talk to a system directly. Most of the time you will have some sort of authority to talk to the service; this might be a token key or username and password combination, for example. Some APIs are public and don't require any sign-up, but those are rarer now because suppliers like to know who's calling, see what data you're taking, and know how often you're taking it. P a g e | 21 The website OpenWeather (https://openweathermap.org) has a full suite of APIs to retrieve weather information. 2.7 Data Migration Acquiring data is one part of the equation; migrating and transforming, it will also be requested at some point. For some jobs, writing a small program or script to import/export data would be fine, but as the volumes grow and the demands from stakeholders get more complex, we need to start looking at alternative tools. Embulk is an open source bulk loading tool. It provides a number of plugins to read, write, and transform data. For example, if you wanted to read a directory of CSV files, transform them to JSON, and write them to AWS S3, that can be done with Embulk with a single configuration file. If you are using the OpenJDK, then it uses version 8 without any issues. 2.8 Summary This Chapter has discussed the planning of any ML project and outlined a few techniques for acquiring data, whether that be via page scraping, using Google Sheets to import table data, or using scripting languages to clean up files. If an API is available, then it makes sense to maximize the potential gains from it whenever you can. When the volumes of data start to build, then it's worth using tools designed for the job instead of crafting your own. The open source Embulk application is an excellent example of what has been created in the open source world. P a g e | 22 2.10 2.9 Review ReviewQuestions Questions 1. 6. What is the aim of aEmbulk? ML Project Cycle? 2. 7. Briefly describe steps involved in a ML Dataproject. Acquisition process. 3. 8. What is the difference between Data Quality and and Acquisition DataData Cleaning? Migration? 4. 9. Describe the function of aanpresence API. and range check. 5. 10. Why WhatisisData ETL in Privacy ML? (Give important a practical in ML?example (Give an ofexample ETL in real of life.) Data Privacy.) Read Recommended Reading: Bell, J., 2020. Machine learning: hands-on for developers and technical professionals. John Wiley & Sons. Esposito, D. and Esposito, F., 2020. Introducing Machine Learning. Microsoft Press. P a g e | 23 2.10 MCQs (Quick Quiz) 6. Which of the following is an example of data migration? a) Embulk b) OpenJDK c) ETL d) a & b e) None 7. What is an API? a) A software to append data b) An application to extract data c) A ML tool to visualise features d) None of the above 8. Which one is not a good example of scraping data in ML? a) Process data b) Extract data c) Transform data d) Load data 9. Which question is likely to be asked while scraping data in ML? a) Where and how to plot graphs? b) Where and how to extract data? c) Which ML algorithm to use? d) None of the above 10. How many phases do we have in a Data Acquisition process? a) Five b) Nine c) Three d) None of the above P a g e | 24 2.11 MCQs (Quick Quiz) 11. Which of the following is an example of data cleaning? a. Data processing b. Range checks c. Length checks d. a & b e. b & c 12. What is transfer learning? a) Adaptive computation settings b) Extracting and transferring features c) Learning with transformation d) None of the above 13. Which one is not a good example of a ML project question? a) Can we classify patient’s results and predict their medication? b) How to detect fraudulent banking transactions? c) Can we get input from users? d) All of the options 14. Which one is likely to be used in a ML experiment? a. Dataset & ML algorithms b. ML tools & whiteboard c. Data security d. Data modelling 15. How many phases do we have in a ML project cycle? a. Five b) Nine c) Three d) Four P a g e | 25 Chapter 3: Statistics, Randomness and Linear Regression in ML LEARNING OUTCOMES After reading this Section of the guide, the learner should be able to: Understand how to work with ML datasets Understand Datasets conversion in ML Understand the basic of statistics Understand Linear Regression in ML Understand the concept of randomness & prediction in ML After acquiring and cleaning our data, it's now time to focus our attention on some numbers. As an introduction, it is a good idea to revisit some statistics and how they can be used. In addition, this chapter will cover standard deviation, Bayesian techniques, forms of linear regression, and the power of random numbers. 3.1 Working with ML datasets You can download ML datasets from the GitHub repository and your first task is to convert the contents of each line of the text file and convert them to an integer type that your program can understand. 3.1.1 Using Java to load ML datasets The process is identical to the Clojure process, though in the Java language it's a little more involved in terms of code. P a g e | 26 Using the BufferedReader and FileReader objects, a stream is created to read in the file. After iterating each line, it converts the value to an integer and adds it to the list. Notice the use of the Double object to call the parseDouble method. It's the same method as used by the Clojure program. Assuming the resulting functions have been stored in a new object, then it's ready for use to get some summary statistics. 3.1.2 Basic statistics This section covered the basic of statistics: the sum, minimum and maximum, mean, mode, median, range, variance, and standard deviation: Minimum & Maximum values: Finding the minimum and maximum values of a list of numbers, while not seemingly ground breaking in terms of stats or ML, is still worthwhile to know. The Collections object will give you access to the methods min and max assuming that the input type is a collection. Sum: The sum, or rather summation, is the addition of a sequence of numbers. The result is a single value of the total. The order of the numbers is not important in summation. For example, the summation of [1, 2, 3, 4] is the same as [3, 1, 4, 2]. With Java, things require a little more thought, as we are dealing with a collection of objects. At this point, we could write a method to get the sum for us, iterating each value in the collection and adding to the accumulative total. An alternative would be to use the Arrays class and use the stream() method. Be aware that this method uses only primitive arrays as it's input, so you need to convert the List first. Mean: The mean, or the average, is one of the first statistical methods you'll learn at school. When we say “the mean” or “the average,” we are normally referencing the arithmetic mean. The mean gives us a good idea of where the middle is in a set of data. However, there is a caveat to that: a nice smooth average is working with the assumption that the dataset is evenly distributed. If there are outliers within the dataset, then the average can be heavily distorted and incorrect. When there are outliers in the data, then it's wiser to use the median as a gauge. Arithmetic Mean: To calculate the arithmetic mean, take the set of numbers and sum them. The last step is to divide that summed number by the number of items in the dataset. P a g e | 27 Harmonic Means: The harmonic mean is calculated differently. There are three steps to complete the calculation. Geometric Mean: If the values in your dataset are widely different, then it's worth using the geometric mean to find the average. The calculation is made by multiplying the set of numbers and finding the 𝑛𝑡ℎ root of the total. For example, if your set had two numbers in it, you'd square root the total; if it had three numbers, you would cube root; and so on. The following are two examples, one with a set of three numbers and another with a set of six numbers: Mode: To find the most commonly used number in the dataset, we use the mode. Use the StatUtils.mode method in Apache Commons Math to get the mode of a double primitive array. Notice it returns a double primitive array. Median: To find the middle number of the dataset, you use the median. Finding the median number involves listing the dataset in ascending order and finding the middle number. If the total number of values in the dataset is odd, then the middle number is going to be a value from the dataset. On the other hand, if the dataset has an even set of values, then the average of the middle two numbers of the dataset is used. Using the DescriptionStatistics class, the getPercentile method will give the median from a collection. You will have to iterate the collection and add the double value to the instance of the class with the addValue method. P a g e | 28 Range: The range of the dataset is calculated by taking the minimum value of the set from the maximum value. So, for example, the dataset looks like this: Interquartile Range: As already discussed, if a dataset has outliers, the arithmetic mean will not be the centered average you are looking for. It's best using either the harmonic or geometric mean. The range gives a complete spread of the data, start to end. The interquartile range gives you the bulk of the values, also known as “the middle 50.” Subtracting the third quartile of the dataset from the first quartile will give you the interquartile range. Variance. The variance will give you the spread of the dataset. If you have a variance of zero, then all the values of the dataset are the same. There is a process to working out the variance of a dataset: Work out the mean of the dataset. For each number in the dataset, subtract the mean and then square the result. Calculate the average of the squared differences. Standard Deviation: The standard deviation (SD) is a number that tells us how the values for a dataset are spread out from the mean. If the standard deviation is low, then that means that most of the numbers in the dataset are close to the average. A large standard deviation will show that the numbers in the set are more spread out from the average. The majority of the working out for the standard deviation is done by calculating the variance. The missing step is to square root the variance of the dataset. The values that lie in the distribution can be calculated once you have the SD. Called the empirical rule (or the 68-95- 99.7 rule), it will tell you that 68 percent of the values will lie within two standard deviations to the mean, 95 percent within three and 99.7 percent within four. 3.2 Linear Regression While linear regression is not a ML algorithm, it is classed as a statistical method. Regardless, being able to predict a value from historical data is a worthwhile skill to have at your disposal. Simple linear regression plots an independent variable (the predictor) against a dependent variable (criterion variable). P a g e | 29 A good example uses the two commonly used temperature scales, Fahrenheit and Celsius, because there's a relationship between the two. It's illustrated with the following regression equation: 𝐹𝑎ℎ𝑟𝑒𝑛ℎ𝑒𝑖𝑡 = 1.8𝑥 + 32 Say we have a temperature reading of 28 Celsius. To find the Fahrenheit reading, we multiply 28 by 1.8 and add 32. The answer is 82.4f. You can generate your own linear regression calculations easily either by using a spreadsheet or by using a library. In this example, we're going to use the comma-separated value and generate a simple linear regression by using an application. The data is comprised of two sets of scores from a competition (see Figure 3.1). With the scores of the first judge, is it possible to reliably predict the scores of the second judge? We can find out by using simple linear regression. Figure 3.1: Dataset showing judge’s scores P a g e | 30 3.2.1 Scatter Plots The next step is to create a simple scatter plot graph. Select all the numbers in both columns and click insert at the top. The top section of Excel will display a new set of icons; look for the Graph section, and you will see a scatter plot diagram. Clicking this will open a dialog box with scatter plot options. Choose the Scatter option, which is the basic plot. See Figure 3.2 Figure 3.2: Scatter Plot of the two judges’ scores 3.2.2 Trendline First, we would like to see a trendline to show where the data lies relative to the slope. Click the displayed scatter plot, and the options in the top menu will change. Click Add Chart Element, and a drop-down menu will appear. Select Trendline; then move your mouse across to the new menu and select Linear. See Figure 3.3. P a g e | 31 Figure 3.3: Trendline add to the scatter plot 3.2.3 Prediction At this point you can use a calculator to make a prediction. Looking at the graph, we can see this equation: 𝑦 = 0.6735𝑥 + 3.0788 Assuming we want to predict what the judge's score will be if we rate a 6 in the competition, we can find out with the following equation: Rounding down, we get the score of 7. 3.3 Programing There comes a time when you will want to progress past a spreadsheet. This might be because there's so much data to process, for example. When using Java, the Apache Commons Math library has an implementation of simple linear regression. The process is straightforward. The first step is to load the text file and add each comma pair into a collection. P a g e | 32 Using the addData method, the double values for both scores are passed in; the string to primitive double data type conversion happens during this step. 3.4 Randomness It's not always essential for you to have data at hand to do any work. Random numbers can bring up some interesting experiments and code. In this section, we're going to look at two aspects of using random numbers. First we'll look at finding Pi using some basic math and Monte Carlo methods; second we'll look at random walks. Using Random numbers to find Pi value. The Monte Carlo method is the concept of emulating a random process. When the process is repeated many times, it will give rise to the approximation of some mathematical quantity of interest. So, in theory with enough random darts thrown at a circle, you should be able to find the number of Pi. (See Figure 3.4, Figure 3.5 & Figure 3.6.) Figure 3.4: Empty Square P a g e | 33 Figure 3.5: Draw a circle within a square Figure 3.6: Placing enough random data in the square will give you darts that are in the square, and some of them will be within the circle. These are the darts that we're really interested in. P a g e | 34 These are random throws. You might throw 10 times; you might throw 1 million times. At the end of the dart throws, you count the number of darts within the circle, divide that by the number of throws (10 million, 1 million, and so on), and then multiply it by 4. 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑟𝑡𝑠 𝑃𝑖 = 4 ∗ ( ) 𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑖𝑛𝑡𝑠 The more throws we do, the better chance we get of finding a number near Pi. This is the law of large numbers at work. It's a classic computer science problem which can be solved and implemented in Java or Python. 3.5 Summary Mathematics underpins everything that is done within ML. This Chapter acts as a reminder to some of the basic summary statistics, building on this knowledge to produce techniques like linear regression, standard deviation, and Monte Carlo methods. P a g e | 35 3.6 Review Questions 1. How can one create a stream to read the content of a file? 2. Briefly describe the difference between arithmetic mean, variance and harmonic mean. 3. What is the difference between SD and the median? 4. Describe the function Linear Regression in ML. 5. How can one randomly compute Pi value? Read Recommended Reading: Bell, J., 2020. Machine learning: hands-on for developers and technical professionals. John Wiley & Sons. Esposito, D. and Esposito, F., 2020. Introducing Machine Learning. Microsoft Press. P a g e | 36 3.7 MCQs (Quick Quiz) 6. What is the arithmetic mean of 4, 3, 7, and 2? a) 2 b) 3 c) 4 d) 16 e) 10 7. What is the range of 4, 3, 7, 3, 2 and 12? a) 9 b) 8 c) 10 d) 6 8. Which statistical technique do we use to show where the data lies relative to the scope? a) Scatter plot b) SD c) Range d) Trendline 9. What is Linear Regression? a) AI method b) ML method c) Statistical method d) Data visualisation method P a g e | 37 Chapter 4: Decision Trees & Clustering LEARNING OUTCOMES After reading this Section of the guide, the learner should be able to: Understand the concept of Decision Trees & Clustering Understand different Clustering Models Understand Decision Trees algorithm types Understand Entropy calculation Understand Cross Validation Methods & Data Visualization 4.1 Introduction to Basics Decision Trees Think about how you select different options within an automated telephone call. The options are essentially decisions that are being made for you to get to the desired department. These decision trees are used effectively in many industry areas. Financial institutions use decision trees. One of the fundamental use cases is in option pricing, where a binary-like decision tree is used to predict the price of an option in either a bull or bear market. Marketers use decision trees to establish customers by type and predict whether a customer will buy a specific type of product. In the medical field, decision tree models have been designed to diagnose blood infections or even predict heart attack outcomes in chest pain patients. Variables in the decision tree include diagnosis, treatment, and patient data. The gaming industry now uses multiple decision trees in movement recognition and facial recognition. The Microsoft Kinect platform uses this method to track body movement. The Kinect team used one million images and trained three trees. Within one day and using a 1,000-core cluster, the decision trees were classifying specific body parts across the screen. P a g e | 38 4.1.1 Why using Decision Trees? There are some good reasons to use decision trees. For one thing, they are easy to read. After a model is generated, it's easy to report to others regarding how the tree works. Also, with decision trees you can handle numerical or categorized information. Later, this chapter demonstrates how to manually work through an algorithm with category values; the example walk-through uses numerical data. In terms of data preparation, there's little to do. As long as the data is formalized in something like comma-separated variables, then you can create a working model. This also makes it easy to validate the model using various tests. With decision trees you use white-box testing—meaning the internal workings can be observed but not changed; you can view the steps that are being used when the tree is being modelled. Decision trees perform well with reasonable amounts of computing power. If you have a large set of data, then decision tree learning will handle it well. 4.1.2 Disadvantages of using Decision Trees One of the main issues of decision trees is that they can create overly complex models, depending on the data presented in the training set. To avoid the ML algorithm's over-fitting the data, it's sometimes worth reviewing the training data and pruning the values to categories, which will produce a more refined and better-tuned model. Some of the decision tree concepts can be hard to learn because the model cannot express them easily. This shortcoming sometimes results in a larger-than-normal model. You might be required to change the model or look at different methods of ML. 4.1.3 Decision Trees Types ID3. The ID3 (Iterative Dichotomiser 3) algorithm was invented by Ross Quinlan to create trees from datasets. By calculating the entropy for every attribute in the dataset, this could be split into subsets based on the minimum entropy value. After the set had a decision tree node created, all that was required was to recursively go through the remaining attributes in the set. ID3 uses the method of information gain—the measure of difference in entropy before and after an attribute is split—to decide on the root node (the node with the highest information gain). ID3 suffered from over-fitting on training data, and the algorithm was P a g e | 39 better suited to smaller trees than large ones. The ID3 algorithm is used less these days in favour of the C4.5 algorithm, which is outlined next. C4.5. It's also based on the information gain method, but it enables the trees to be used for classification. This is a widely used algorithm in that many users run in Weka with the open source Java version of C4.5, the J48 algorithm. There are notable improvements in C4.5 over the original ID3 algorithm. With the ability to work on continuous attributes, the C4.5 method will calculate a threshold point for the split to occur. For example, with a list of values like the following: C4.5 has the ability to work despite missing attribute values. The missing values are marked with a question mark (?). The gain and entropy calculations are simply skipped when there is no data available. CHAID: The CHAID (Chi-squared Automatic Interaction Detection) technique was developed by Gordon V. Kass in 1980. The main use of it was within marketing, but it was also used within medical and psychiatric research. MARS. For numerical data, it might be worth investigating the MARS (multivariate adaptive regression splines) algorithm. You might see this as an open source alternative called “Earth,” as MARS is trademarked by Salford Systems. 4.1.4 The intuition behind Decision Trees Every tree is comprised of nodes. Each node is associated with one of the input variables. The edges coming from that node are the total possible values of that node. A leaf represents the value based on the values given from the input variable in the path running from the root node to the leaf (see Figure 4.1). P a g e | 40 Figure 4.1: Decision Trees Decision trees always start with a root node and end on a leaf. Notice that the trees don't converge at any point; they split their way out as the nodes are processed. Figure 4.1 shows a decision tree that classifies a loan decision. The root node is “Age” and has two branches that come from it, whether the customer is younger or older than 55. The age of the client determines what happens next. If the person is younger than 55, then the tree prompts you to find out if he or she is a student. If the client is older than 55, then you are prompted to check his or her credit rating. With this type of ML, you are using supervised learning to deduce the optimal method to make a prediction; what we mean by “supervised learning” is that you give the classifier data with the outcomes. The real question is, “What's the best node to start with as the root node?” Check the model for the base cases. Iterate through all the attributes (attr). Get the normalized information gain from splitting on attr. Let best_attr be the attribute with the highest information gain. Create a decision node that splits on the best_attr attribute. Work on the sublists that are obtained by splitting on best_attr and add those nodes as child nodes. P a g e | 41 That's the basic outline of what happens when you build a decision tree. Depending on the algorithm type, like the ones previously mentioned, there might be subtle differences in the way things are done. 4.2 Entropy Computation Entropy is a measure of uncertainty and is measured in bits and comes as a number between zero and 1 (entropy bits are not the same bits as used in computing terminology). Basically, you are looking for the unpredictability in a random variable. There are two entropy types to work out; the first is entropy on a single attribute, and the second is entropy for two attributes. With this example, we will use the Had Credit Account as our target attribute for our working example. By the end of the exercise, we will know the next node to use in the decision tree. The single attribute for the Has Credit Account attribute has two outcomes (Yes or No). (See Table 4.1.) Table 4.1: Original table The entropy calculation looks as follows: 4.2.1 Information Gain Computation When you know the gain before (0.97095) and after (0.5884) the split in the attribute, you can calculate the information gain. With the attribute to see if the customer has a credit account, your calculation will be the following: P a g e | 42 4.3 Clustering Do not confuse ML clustering with clusters of machines in a networking sense. If you search for all the definitions of clustering out there, you get “organizing a group of objects that share similar characteristics.” It is classed as an unsupervised learning method, which means there is no prior training data from which to learn. In Figure 4.2 you can see there are three distinct groupings of data; each one of those groups is a cluster. The main aim is to find structure within a given set of data. Because there are a lot of algorithms to choose from, clustering casts a wide net. This is where experimentation comes in handy; which algorithm is the right choice? Sometimes you just need to put some code together and play with it. Figure 4.2: Clustering 4.3.1 Why using Clustering? Clustering is a widely used ML approach. Although it might seem simple, do not underestimate the importance of grouping multivariate data into refined groupings. See the applicability of clustering in real life in Table 4.2. P a g e | 43 4.3.2 Clustering Models Graph models can show cluster-like properties when the nodes start showing as small subsets connected with one main edge (see Figure 4.3). You can also approach simple clustering with groups in the same way you group in a structured query language. Table 4.2: Clustering in real life Domain Description The Internet To determine the community of users Business and Retail Customers can be grouped based on various attributes (location, items, gender etc.) Law Enforcement To predict when and where crimes will occur Computing Motion detection and remote sensing Figure 4.3: Clustering Models (Nodes & Edges) 4.3.3 K-Means Algorithms If you have a group of objects, the idea of the k-means algorithm is to define a number of clusters. What's important is that it's up to you to define how many clusters you want. For example, say we have 1,000 objects and we want to find four clusters: P a g e | 44 Each one of the clusters has a centroid (sometimes called the mean, hence the name k- means), a point where the distance of the objects will be calculated. The clusters are defined by an iterative process on the distances of the objects to calculate which are nearest to the centroid. This is all done unsupervised; you just have to let the algorithm do its processing and inspect the results. After the iterations have taken place to the point where the objects don't move to different centroids, then it's assumed that the k-means clustering is complete. The following pseudo code describes what's happening: First, the algorithm must initialize by assigning a cluster to every observation made. The random partition method places the cluster points toward the center of the dataset. Another initialization method is the Forgy method, which spreads out the randomness of the initial location of the cluster. After the initial cluster observations are assigned, you can look at the assignment and updating of the algorithm. Assignments. Each observed object is assigned to the cluster to find out which cluster centroid it's assigned to; the algorithm uses a Euclidean distance measurement. The sum of squares is then calculated by squaring the Euclidean distances to each cluster centroid, and the one with the smallest value is the cluster that the object is assigned to. Calculating the Euclidean distance is quite simple and requires only some entry-level math; if you can remember how to do Pythagoras' theorem, then you are already there. Assume a basic grid of six positions on the X-axis (horizontal) and four positions on the Y-axis (vertical). The center point of our cluster is currently at (1, 6), and the object is located at 3, 1 (see Figure 4.4). The distance is 3-1=2 on the vertical side and 6-1=5 on the horizontal axis. Using Pythagoras’s theorem, the squared distance is: P a g e | 45 Figure 4.4: Cluster Centroid 4.4 Cross Validation By splitting the dataset into separate partitions, you can apply the analysis on the dataset and then on the remaining partitions. By averaging the results of the sum of squares, you can determine the number of clusters to use. 4.4.1 Silhouette Method Peter J. Rousseeuw first described the silhouette method in 1986. It is a method for suggesting a way of validating where the objects lay within a cluster. For any object, you can calculate how similar an object is with another object within the same cluster. By calculating the averages of objects that connect to a cluster and then evaluating how dissimilar they are in relation to the other clusters, you can determine an average score. The main aim is to measure the grouping of the objects in the cluster—the lower the number the better. When comparing the averages for each cluster, they are expected to be similar. When P a g e | 46 silhouettes are very narrow and others are large, it might point to the fact that not enough clusters have been defined when the computation process began. 4.4.2 Data Visualization The last thing to do is look at the visualization of the clustering. Each cluster has its own colour scheme, and the plot shows values (see Figure 4.5). Figure 4.5: Data visualization 4.5 Summary This chapter has shown how decision trees work and the different algorithm types that are available. Although many people perceive decision trees as simple, do not underestimate P a g e | 47 their uses. They are often useful regardless of whether you have category or numerical data. Clustering will be one of those machine learning techniques that you'll pull out again and again. To that end, it does need some thought before you go building clusters and seeing what happens. The chapter have discussed clustering with simple k-means clusters. Obviously, there are plenty of options from this point on, but with what you've read in this chapter, you'll be able to get a system up and working quickly. P a g e | 48 4.6 Review Questions 6. Describe the intuition behind a silhouette method. 7. Briefly describe the working process of clustering. 8. List three advantages of using Decision Trees. 9. Describe the limitations of Decision Trees. 10. What is K-Means algorithm? Read Recommended Reading: Bell, J., 2020. Machine learning: hands-on for developers and technical professionals. John Wiley & Sons. Esposito, D. and Esposito, F., 2020. Introducing Machine Learning. Microsoft Press. P a g e | 49 4.7 MCQs (Quick Quiz) 6. Which ML algorithm uses the concept of Euclidian distance? a. Decision Trees b. K-Means c. Supervised algorithms d. Unsupervised algorithms e. Neural Networks 7. Which one best describe Cross Validation? f. A manner of evaluating ML models g. Data visualisation process h) Data cleaning i) None of the above 8. How can one create a clustering model? a) Using nodes and graphs b) Using Decision Trees c) Using groups & Graphs d) a & b 9. Which concept are important for Decision Trees? a) Information b) Data c) Entropy & Information Gain d) None of the above 10. Which one is not a type of Decision Trees? e) J.48 algorithm f) ID3 algorithm g) C4.5 algorithm h) None of the above P a g e | 50 Chapter 5: Association Rules Learning - Support Vectors Machine & Neural Networks LEARNING OUTCOMES After reading this Section of the guide, the learner should be able to: Understand the concept of Association Rules learning Understand Support Vectors Machine & its applicability Understand Linear Classifiers Understand the concept of Non-Linear Classification Understand the concept of Neural Network algorithm & its applicability Among the ML methods available, association rules learning is probably the most used. From point-of-sale systems to web page usage mining, this method is employed frequently to examine transactions. It finds out the interesting connections among elements of the data and the sequence (behaviours) that led to some correlated result. This chapter describes association rules learning and also goes through Support Vectors Machines (SVM) and Neural Networks (NN) algorithms. 5.1 Association Rules The retail industry is tripping over itself to give you, the customer, offers on merchandise it thinks you will buy. To do that, though, it needs to know what you've bought previously and what other customers, similar to you, have bought. Brands such as Tesco and Target thrive on basket analysis to see what you've purchased previously. If you think the amount of content that Twitter produces is big, then just think about point-of-sale data; it's another world. Some supermarkets fail to adopt this technology and never look into baskets, much to their competitive disadvantage. If you can analyse baskets and act on the results, then you P a g e | 51 can see how to increase bottom-line revenue. Association rules learning isn't only for retail and supermarkets, though. In the field of web analytics, association rules learning is used to track, learn, and predict user behaviour on websites. There are huge amounts of biological data being mined to gain knowledge. Bioinformatics uses association rules learning for protein and gene sequencing. It's on a smaller scale compared to something like computational biology, as it homes in on specifics compared to something like DNA. So, studies on mutations of genomes are part of a branch of bioinformatics that's probably working with it. 5.1.1 Web Mining Knowing which pages a user is looking at and then suggesting which pages might be of interest to the user is commonplace to keep a website more compelling and “sticky.” For this type of mining, you require a mechanism for knowing which user is looking at which pages; the user could be identified by a user session, a cookie ID, or a previous user login where sites require users to log in to see the information. If you have access to your website log files, then there is opportunity for you to mine the information. Many companies use the likes of Google Analytics as it saves them mining logs themselves, but it's worthwhile doing your own analysis if you can. The basic log file, for example, has information against which you could run some basic association rules learning. Looking at the Apache Common Log Format (CLF), you can see the IP address of the request and the file it was trying to access. By extracting the URL and the IP address, the association rules could eventually suggest related content on your site that would be of interest to the user. There are several algorithms used in association rule learning that you'll come across; the two described in this section are the most prevalent: Apriori algorithm. Using a bottom-up approach, the Apriori algorithm works through item sets one at a time. Candidate groups are tested against the data; when no extensions to the set are found, the algorithm will stop. The support threshold for the example in Figure 5.1 is 3. P a g e | 52 Figure 5.1: Apriori algorithm As {1,2}, {1,3}, {1,4}, and {2,3} are under the chosen support threshold you can reject them from the triples that are in the database. In the example, there is only one triple: {2,3,4} = 1 (we've discounted one from the {1,2} group) From that deduction, you have the frequent item sets. FP-Growth algorithm. The Frequent Pattern Growth (FP-Growth) algorithm works as a tree structure (called an FP-Tree). It creates the tree by counting the occurrences of the items in the database and storing them in a header table. In a second pass, the tree is built by inserting the instances it sees in the data as it goes along the header table. Items that don't meet the minimum support threshold are discarded; otherwise, they are listed in descending order. You can think of the FP-Growth algorithm like a graph. With a reduced dataset in a tree formation, the FP-Growth algorithm starts at the bottom—the place with the longest branches—and finds all instances of the given condition. When no more single items match the attribute's support threshold, the growth ends, and then it works on the next part of the FP-Tree. P a g e | 53 5.2 Support Vectors Machine A support vector machine (SVM) is essentially a technique for classifying objects. It's a supervised learning method, so the usual route for getting a SVM set up would be to have some training data and some data to test the algorithm. With SVM, you have the linear classification—it's either that object or it's that object—or nonlinear. There is a lot of comparison of using a SVM versus the artificial neural network (ANN), especially as some methods of finding minimum errors and the sigmoid function are used in both. It's easy to imagine a SVM as either a two- or three-dimensional plot with each object located within. Essentially, every object is a point in that space. If there's sufficient distance in the area, then the process of classifying is easy enough. 5.3 Why using SVM? SVM are used in a variety of classification scenarios, such as image recognition and handwriting pattern recognition. Image classification can be greatly improved with the use of SVM. Being able to classify thousands or millions of images is becoming more and more important with the use of smartphones and applications like Instagram. SVM can also do text classification on normal text or web documents, for instance. Medical science has long used SVM for protein classification. The National Institute of Health has even developed a SVM protein software library. It's a web-based tool that classifies a protein into its functional family. Some people criticize the SVM because it can be difficult to understand, unless you are blessed with a good mathematician who can guide and explain to you what is going on. In some cases you are left with a black-box implementation of a SVM that is taking in input data and producing output data, but you have little knowledge in between. ML with SVM takes the concept of a perceptron a little bit further to maximize the geometric margin. It's one of the reasons why SVM and ANN are frequently compared in function and performance. P a g e | 54 5.3.1 SVM classification principle Consider a basic classification problem. You want to figure out which objects are squares and which are circles. These squares and circles could represent anything you want—cats and dogs, humans and aliens, or something else. An illustration of the two sets of objects is provided in Figure 5.2. Figure 5.2: An illustration of binary classification P a g e | 55 This task would be considered a binary classification problem, because there are only two outcomes; it's either one object or the other. Think of it as a 0 or a 1. With some supervised learning, you could figure out pretty quickly where those classes would lie with a reasonable amount of confidence. What about when there are more than two classes? For example, you can add triangles to the mix, as shown in Figure 5.3. Figure 5.3: Multiple classes to be classified Binary classification isn't going to work here. You're now presented with a multiclass classification problem. P a g e | 56 Because there are more than two classes, you have to use an algorithm that can classify these classes accordingly. It's worth noting, though, that some multiclass methods use pair-wise combinations of binary classifiers to get to a prediction. Linear classifier. To determine in which group an object belongs, you use a linear classifier to establish the locations of the objects and see if there's a neat dividing line. This line is called a hyperplane—in place; there should be a group of objects clearly on one side of the line and another group of objects just as clearly on the opposite side. (That's the theory, anyway. Assume that all your ducks are in a row…well, two separate groups. As shown in Figure 5.4, visually it looks straightforward, but you need to compute it mathematically. Every object that you classify is called a point, and every point has a set of features. Figure 5.4: SVM hyperplane – Linear classification P a g e | 57 For each point in the graph, you know there is an x-axis value and there is a y-axis value. The classification point is calculated as follows: 𝑠𝑖𝑔𝑛 (𝑎𝑥 + 𝑏𝑦 + 𝑐) The values for a, b, and c are the values that define the line; these values are ones that you choose, and you'll need to tweak them along the way until you get a good fit (clear separation). What you are interested in, though, is the result; you want a function that returns +1 if the result of the function is positive, signifying the point is in one category, and returns - 1 when the point is in the other category. The function's resulting value (+1 or -1) must be correct for every point that you're trying to classify. Don't forget that you have a training file with the correctly classified data so that you can judge the function's correctness; this approach is a supervised method of learning. This step has to be done to figure out where the line fits. Points that are further away from the line show more confidence that they belong to a specific class (see Figure 5.5). Figure 5.5: SVM Max margin Hyperplane P a g e | 58 The objects that lie on the edge hyperplanes are the support vectors (see Figure 5.6): Figure 5.6: Support Vectors Non Linear classification. In an ideal world, the objects would lie on one side of the hyperplane or the other. Life, unfortunately, is rarely like that. Instead, you see objects straying from the hyperplane (see Figure 5.7). P a g e | 59 Figure 5.7: Objects do not go where you want them to go By applying the kernel function (sometimes referred to as the “kernel trick”), you can apply an algorithm to fit the hyperplane's maximum margin in a feature space. The method is similar to the dot products discussed in the linear methods, but this replaces the dot product with a kernel function. With a radial basis function, you have a few kernel types to choose from: the hyperbolic tangent, Gaussian radial basis function (or RBF), and two polynomial functions— one homogenous and the other inhomogeneous. 5.4 Neural Networks In biology terms, a neuron is a cell that can transmit and process chemical or electrical signals. The neuron is connected with other neurons to create a network; picture the notion of graph theory with nodes and edges, and then you're picturing a neural network. Within humans, there are a huge number of neurons interconnected with each other—tens of billions of interconnected structures. Every neuron has an input (called the dendrite), a cell body, and an output (called the axon), as shown in Figure 5.8. P a g e | 60 Figure 5.8: Neurone Structure Outputs connect to inputs of other neurons, and the network develops. Biologically, neurons can have 10,000 different inputs, but their complexity is much greater than the artificial ones we are talking about here. Neurons are activated when the electrochemical signal is sent through the axon. The cell body determines the weight of the signal, and if a threshold is passed, the firing continues through the output, along the dendrite. What is Neural Networks: Artificial neural networks are essentially modelled on the parallel architecture of animal brains, not necessarily human ones. The network is based on a simple form of inputs and outputs. (See Table 5.1 for the application of neural network in real life.) P a g e | 61 Table 5.1: Neural Network in real life Domain Description High-Frequency Trading Thousands of transactions can be done in the same time it takes a human to make one Credit Applications To spot good and bad credit factors Data Center Management With incoming data on loads, operating temperatures, network equipment usage, and outside air temperatures, Google can calculate efficiency of the data center and be able to adjust the settings on monitoring and cooling equipment. Robotics Pattern recognition, and some requires huge amounts of sensor data Medical Monitoring constant updating of many variables, such as heart rate, blood pressure, and so on Perceptrons. The basis for a neural network is the perceptron. Its role is quite simple. It receives an input signal and then passes the value through some form of function. It outputs the result of the function (see Figure 5.9). Figure 5.9: Perceptrons P a g e | 62 Activation Function. The activation function is the processing that happens after the input is passed into the neuron. The result of this function determines whether the value is passed to the output axon and onto the next neuron in the network. Commonly, the sigmoid function (see Figure 5.10) and the hyperbolic tangent are used as activation functions to calculate the output. The Sigmoid function outputs only one of two values: 0 and 1. Figure 5.10: Sigmoid Function (Sigmoid because of the S form) Multilayer Perceptrons. The problem with single-layer perceptrons is that they are linearly separable. The output is either one value or another. Multilayer perceptrons have one or more layers between the input nodes and the eventual output nodes. Middle layer are called a hidden layer, between the input and the output (see Figure 5.11). The question is, what happens in the hidden layer? The input values would then be fed to the hidden layer, and the input is dependent on the output of the input layer. P a g e | 63 Figure 5.11: A neural network with one input, one hidden and one output layer This is where the neural network becomes useful. You can train the network for classification and pattern recognition, but it does require training. You can train an artificial neural network by unsupervised or supervised means. The issue is that you don't know what the weight values should be for the hidden layer. By changing the bias in the Sigmoid function, you can vary the output layer, an error function can be applied, and the aim is to get the value of the error function to a minimum value. You need something that is continuous and differentiable. With the bias option implemented in the Sigmoid function, each run of the network refines the output and the error function. This leads to a better-trained network and more reliable answers. Back propagation. Within the multilayer perceptron is the concept of back propagation, short for the “backward propagation of errors.” Back propagation calculates the gradients and maps the correct inputs to the correct outputs. There are two steps to back propagation: the propagation phase and the updating of the weight. This would occur for all the neurons in the network. Propagation happens, and the training is input through the network and generates the activations of the output. It then backward propagates the output activations and generates deltas of all the output and hidden layers of the network based on the target of the training pattern. P a g e | 64 In the second phase, the weight update is calculated by multiplying the output delta and input activation. This gives you the gradient weight. The percentage ratio is then subtracted from the weight. The second part is done for all the weight axons in the network. The percentage ratio is called the learning rate. The higher the ratio, the faster the learning. With a lower ratio you know the accuracy of the learning is good. 5.4.1 Data Preparation for ANN For creating an ANN, it's worth using a supervised learning method. However, this requires some thought about the data that you are going to use to train the network. ANN work only with numerical data values. So, if there are normalized things with text values, they need to be converted. This isn't so much an issue with the likes of gender, where the common output would be Male = 0 and Female = 1, for example. Raw text wouldn't be suitable, so it will either need to be tidied up, hashed to numeric values, or removed from the test data. As with all data strategies, it's a case of thinking about what's important and what data you can live without. As more variables increase in your data for classification, you will come across the phenomenon called “the curse of dimensionality.” This is when added variables increase the total volume of training data required to get reasonable results and insight. So, when you are thinking of adding another variable, make sure you have enough training data to cover eventualities across all the other variables. Although neural networks are pretty tolerant to noisy data, it's worth trying to ensure that there aren't large outliers that could potentially cause issues with the results. Either find and remove the wayward digits or turn them into missing values. 5.5 Summary This chapter covers the core concepts of how association rules, support vectors machine and neural networks actually work. It's better to have a model that's explainable than not. Care must be taken going forward, especially with live customer data. While it's a computer doing the work and making the predictions, it's still our responsibility to make sure they are fair, are correct, and do not negatively impact another party. P a g e | 65 5.6 Review Questions 1. What is association rules learning? 2. Briefly describe the difference between Apriori and FP-Growth algorithm. 3. What is SVM compared to Neural Network? 4. Describe the function of a sigmoid function. 5. Briefly describe the difference between a Linear and Non-Linear classification. 1. Beside CSV, JSON and XML formats; could you list three additional datasets formats used in the ML community? Read Recommended Reading: Bell, J., 2020. Machine learning: hands-on for developers and technical professionals. John Wiley & Sons. Esposito, D. and Esposito, F., 2020. Introducing Machine Learning. Microsoft Press. P a g e | 66 5.7 MCQs (Quick Quiz) 6. Which ML algorithm is an example of association rules?