Machine Learning Lab Manual (University of Science and Technology)

Summary

This document is a lab manual for an introductory machine learning course at the University of Science and Technology. It covers various Python libraries and techniques for data analysis and modeling. The document contains exercises, examples, and explanations on using Python libraries such as Pandas, Numpy, and Scikit-learn in the context of machine learning.

Full Transcript

**University of Science and Technology** **Faculty of Computer Science and Information Technology** **Introduction to Machine Learning** **Lab Manual** **Prepared By** **Prof. Noureldien A.Noureldien** **October 2024** **Table of Contents** **Weak \#** **Lab Topics**...

**University of Science and Technology** **Faculty of Computer Science and Information Technology** **Introduction to Machine Learning** **Lab Manual** **Prepared By** **Prof. Noureldien A.Noureldien** **October 2024** **Table of Contents** **Weak \#** **Lab Topics** **Page No** ------------- ----------------------------------------------------------------------------- ------------- 1 Python Build-in- Functions and the Math Module 3 2 NumPy Module 14 3 CSV Files 27 4 Pandas Package 41 5 Cleaning of Data Using Pandas 61 6 Understand and Visualize Your Data 71 7 Data Preprocessing and Machine learning Modeling Using Scikit-learn Library 87 8 Dataset Feature Selection Techniques 100 9 Building Supervised learning Classification Model 113 10 Building Learning Regression Models 122 **Lab (1):** **Python Build-in- Functions and the Math Module** **Labs Course Objectives:** The objectives of the machine learning lab course are: To introduce the basic concepts and techniques of Machine learning and the need of Machine learning techniques in real-world problems. To provide understanding of various Machine Learning algorithms and the way to evaluate performance of the Machine Learning algorithms. To apply Machine Learning to learn, predict and classify the real-world problems in the Supervised Learning paradigms as well as discover the Unsupervised Learning paradigms of Machine Learning. To inculcate in students professional and ethical attitude, multidisciplinary approach and an ability to relate real-world issues and provide a cost effective solution to it by developing ML applications. **Labs Course Outcomes:** Upon successful completion of this lab course, the students will be able to: Understand the basic concepts and techniques of Machine Learning and the need of Machine Learning techniques in real-world problems. Understand various Machine Learning algorithms and the way to evaluate performance of the Machine Learning algorithms. Apply Machine Learning to learn, predict and classify the real-world problems in the Supervised Learning paradigms as well as discover the Unsupervised Learning paradigms of Machine Learning. Understand, learn and design Artificial Neural Networks of Supervised Learning for the selected problems. Understand the concept of Reinforcement Learning and Ensemble Methods **Lab (1): Learning Outcomes** By the end of this lab the student must be able to: - Use python build-in Functions - Write C codes that manipulate different Math module functions. **Lab (1): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python build-in Functions and Math Module. - Supervise students while they write and run the C codes in this lab. **1.1 Python Libraries** There are a lot of reasons why Python is popular among developers and one of them is that it has an amazingly large collection of libraries that users can work with. In this Python Library, we will discuss Python Standard library and different libraries offered by Python Programming Language: scipy, numpy,etc. We know that a module is a file with some Python code, and a package is a directory for sub packages and modules. A Python library is a reusable chunk of code that you may want to include in your programs/ projects. Here, a library loosely describes a collection of core modules. Essentially, then, a library is a collection of modules. A package is a library that can be installed using a package manager like numpy. **1.2. Python Standard Library** The Python Standard Library is a collection of script modules accessible to a Python program to simplify the programming process and removing the need to rewrite commonly used commands. They can be used by \'calling/importing\' them at the beginning of a script. A list of the Standard Library modules that are most important time sys csv math random pip os statistics tkinter socket. To display a list of all available modules, use the following command in the Python ***console: \>\>\>help(\'modules\')*** ** List of important Python Libraries** o Python Libraries for Data Collection - Beautiful Soup - Scrapy - Selenium o Python Libraries for Data Cleaning and Manipulation - Math - Pandas - PyOD - NumPy - Scipy - Spacy o Python Libraries for DataVisualization - Matplotlib - Seaborn - Bokeh o Python Libraries for Modeling - Scikit-learn - TensorFlow - Keras - PyTorch **1.3. Python Basic Build-in Functions** A function is a group of statements that performs a specific task. Python, as well as other programming languages, provides a library of functions. These are built-in functions and they are always available in the Python interpreter. You don't have to import any modules to use these functions. Additionally, you can use the built-in functions abs, max, min, pow, and round, as shown in the following slide. ***Exercise1: Simple Python Built-in Functions Example*** ![](media/image2.png) **1.4. Python Math Module Library** The math module is a standard module in Python and is always available. To use mathematical functions under this module, you have to import the module using ***import math***. It gives access to the underlying C library functions. Many programs are created to solve mathematical problems. Some of the most popular mathematical functions are defined in the Python math module. These include trigonometric functions, representation functions, logarithmic functions, angle conversion functions, etc. In addition, two mathematical constants (pi and e) are also defined in this module. ![](media/image4.png) ***Exercise 2: Import Math Module to use math functions*** **Output** ![](media/image6.png) ***Exercise 3: Finding the factorial of the number*** Using the **[factorial()](https://www.geeksforgeeks.org/python-math-factorial-function/) **function we can find the factorial of a number in a single line of the code. An error message is displayed if number is not integral. **Example:** This code imports the math module, assigns the value 5 to the variable a, and then calculates and prints the factorial of a.     +-----------------------------------------------------------------------+ | import math | | | | a = 5 | | | | print(\"The factorial of 5 is : \", end=\"\") | | | | print(math.factorial(a)) | +-----------------------------------------------------------------------+ **Output:** The factorial of 5 is : 120 *** Exercise 4: Finding the GCD*** **[gcd()](https://www.geeksforgeeks.org/python-math-gcd-function/) **function is used to find the greatest common divisor of two numbers passed as the arguments.  **Example:** This code imports the math module, assigns the values 15 and 5 to the variables a and b, respectively, and then calculates and prints the greatest common divisor (GCD) of a and b. - Python3     +-----------------------------------------------------------------------+ | import math | | | | a = 15 | | | | b = 5 | | | | print (\"The gcd of 5 and 15 is : \", end=\"\") | | | | print (math.gcd(b, a)) | +-----------------------------------------------------------------------+ **Output:** The gcd of 5 and 15 is : 5 ***Exercise 5: Finding the Logarithm*** - **log()** function returns the logarithmic value of a with base b. If the base is not mentioned, the computed value is of the natural log. - **log2(a) **function computes value of log a with base 2. This value is more accurate than the value of the function discussed above. - **log10(a)** function computes value of log a with base 10. This value is more accurate than the value of the function discussed above. This code imports the math module and then calculates and prints the logarithms of three different numbers. The math module provides several functions for working with logarithms, including log(), log2(), and log10(). +-----------------------------------------------------------------------+ | import math | | | | print (\"The value of log 2 with base 3 is : \", end=\"\") | | | | print (math.log(2,3)) | | | | print (\"The value of log2 of 16 is : \", end=\"\") | | | | print (math.log2(16)) | | | | print (\"The value of log10 of 10000 is : \", end=\"\") | | | | print (math.log10(10000)) | +-----------------------------------------------------------------------+ **Output:** The value of log 2 with base 3 is : 0.6309297535714574\ The value of log2 of 16 is : 4.0\ The value of log10 of 10000 is : 4.0 ***Exercise 6: Finding the Square root*** [**sqrt()**](https://www.geeksforgeeks.org/python-math-function-sqrt/) function returns the square root of the number.  **Example:** This code imports the math module and then calculates and prints the square roots of three different numbers: 0, 4, and 3.5. The math module provides several functions for working with mathematical operations, including the square root function **sqrt()**.     +-----------------------------------------------------------------------+ | import math | | | | print(math.sqrt(0)) | | | | print(math.sqrt(4)) | | | | print(math.sqrt(3.5)) | +-----------------------------------------------------------------------+ **Output:** 0.0\ 2.0\ 1.8708286933869707 ***Exercise 7: Write a Python program to convert degrees to radians.\ *** The radian is the standard unit of angular measure, used in many areas of mathematics. An angle\'s measurement in radians is numerically equal to the length of a corresponding arc of a unit circle; one radian is just under 57.3 degrees (when the arc length is equal to the radius). Test Data:\ Degree : 15\ Expected Result in radians: 0.2619047619047619 **Python Code:** pi=22/7 degree = float(input(\"Input degrees: \")) radian = degree\*(pi/180) print(radian) **Output:** Input degrees: 90 1.5714285714285714 ***Exercise 8: Write a Python program to convert a binary number to decimal number*** b\_num = list(input(\"Input a binary number: \")) value = 0 for i in range(len(b\_num)): digit = b\_num.pop() if digit == \'1\': value = value + pow(2, i) print(\"The decimal value of the number is\", value) **Output:** Input a binary number: 1000001 The decimal value of the number is 6 **Lab (2):** **NumPy Module** **Lab (2): Lab Description** This lab describes the NumPy module that used to create and manipulate arrays in Python. **Lab (2): Learning Outcomes** By the end of this lab the student must be able to: - Use python NumPy module Functions - Write Python codes that use NumPy module to manipulate arrays. **Lab (2): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python NumPy module. - Supervise students while they write and run the codes in this lab. **2.1. Numpy Python Library** **NumPy** is a general-purpose array-processing Python library which provides handy methods/functions for working **n-dimensional arrays**. NumPy is a short form for "**Numerical Python**". It provides various computing tools such as comprehensive mathematical functions, and linear algebra routines. NumPy provides both the **flexibility of Python** and the speed of **well-optimized** compiled C code. **2.2. Why Numpy ?** **Numpy **is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays. It is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional container of generic data. **2.3. Arrays in Numpy** Array in Numpy is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy, number of dimensions of the array is called **[rank of the array]**. A tuple of integers giving the size of the array along **[each dimension is known as shape of the array]**. **[An array class in Numpy is called as ndarray. Elements in Numpy arrays are accessed by using square brackets and can be initialized by using nested Python Lists.]** **Creating a Numpy Array** Arrays in Numpy can be created by multiple ways, with various number of Ranks, defining the size of the Array. Arrays can also be created with the use of various data types such as lists, tuples, etc. The type of the resultant array is deduced from the type of the elements in the sequences. **Note:** Type of array can be explicitly defined while creating the array. ***Exercise 1: Python program for Creation of Arrays*** 1. *\# Python program for* 2. *\# Creation of Arrays* 3. **import** **numpy** **as** **np** 4. *\# Creating a rank 1 Array* 5. arr = np.array(\[1, 2, 3\]) 6. print(\"Array with Rank 1: **\\n**\",arr) 7. *\# Creating a rank 2 Array* 8. arr = np.array(\[\[1, 2, 3\], \[4, 5, 6\]\]) 9. print(\"Array with Rank 2: **\\n**\", arr) 10. *\# Creating an array from tuple* 11. arr = np.array((1, 3, 2)) 12. print(\"**\\n**Array created using \" 13. \"passed tuple:**\\n**\", arr) **Output** Array with Rank 1: \[1 2 3\] Array with Rank 2: \[\[1 2 3\] \[4 5 6\]\] Array created using passed tuple: \[1 3 2\] **Accessing the array Index** In a numpy array, indexing or accessing the array index can be done in multiple ways. To print a range of an array**[, slicing is]** done. **[Slicing of an array is defining a range in a new array which is used to print a range of elements from the original array]**. Since, sliced array holds a range of elements of the original array, modifying content with the help of sliced array modifies the original array content. **Exercise 2: *Python program to demonstrate indexing in numpy array*** 1. *\# Python program to demonstrate* 2. *\# indexing in numpy array* 3. **import** **numpy** **as** **np** 4. *\# Initial Array* 5. arr = np.array(\[\[-1, 2, 0, 4\], \[4, -0.5, 6, 0\], \[2.6, 0, 7, 8\], \[3, -7, 4, 2.0\]\]) 6. print(\"Initial Array: \") 7. print(arr) 8. *\# Printing a range of Array* 9. *\# with the use of slicing method* 10. sliced\_arr = arr\[:2, ::2\] 11. print (\"Array with first 2 rows and\" 12. \" alternate columns(0 and 2):**\\n**\", sliced\_arr) 13. *\# Printing elements at* 14. *\# specific Indices* 15. Index\_arr = arr\[\[1, 1, 0, 3\], \[3, 2, 1, 0\]\] 16. print (\"**\\n**Elements at indices (1, 3), \" \"(1, 2), (0, 1), (3, 0):**\\n**\", Index\_arr) **Output** Initial Array: \[\[-1. 2. 0. 4. \] \[ 4. -0.5 6. 0. \] \[ 2.6 0. 7. 8. \] \[ 3. -7. 4. 2. \]\] Array with first 2 rows and alternate columns(0 and 2): \[\[-1. 0.\] \[ 4. 6.\]\] Elements a\... **2.4. Basic Array Operations** In Numpy, arrays allow a wide range of operations which can be performed on a particular array or a combination of Arrays. These operation include some basic Mathematical operation as well as Unary and Binary operations. ***Exercise 3: Python program to demonstrate basic operations on single array*** 1. *\# Python program to demonstrate* 2. *\# basic operations on single array* 3. **import** **numpy** **as** **np** 4. *\# Defining Array 1* 5. a = np.array(\[\[1, 2\], 6. \[3, 4\]\]) 7. *\# Defining Array 2* 8. b = np.array(\[\[4, 3\], 9. \[2, 1\]\]) 10. *\# Adding 1 to every element* 11. print (\"Adding 1 to every element:\", a + 1) 12. *\# Subtracting 2 from each element* 13. print (\"**\\n**Subtracting 2 from each element:\", b - 2) 14. *\# sum of array elements* 15. *\# Performing Unary operations* 16. print (\"**\\n**Sum of all array \" 17. \"elements: \", a.sum()) 18. *\# Adding two arrays* 19. *\# Performing Binary operations* 20. print (\"**\\n**Array sum:**\\n**\", a + b) **Output** Adding 1 to every element: \[\[2 3\] \[4 5\]\] Subtracting 2 from each element: \[\[ 2 1\] \[ 0 -1\]\] Sum of all array elements: 10 Array sum: \[\[5 5\] \[5 5\]\] **Constructing a Datatype Object** In Numpy, datatypes of Arrays need not to be defined unless a specific datatype is required. Numpy tries to guess the datatype for Arrays which are not predefined in the constructor function. ***Exercise 4: Python Program to create a data type object*** 1. *\# Python Program to create* 2. *\# a data type object* 3. **import** **numpy** **as** **np** 4. *\# Integer datatype* 5. *\# guessed by Numpy* 6. x = np.array(\[1, 2\]) 7. print(\"Integer Datatype: \") 8. print(x.dtype) 9. *\# Float datatype* 10. *\# guessed by Numpy* 11. x = np.array(\[1.0, 2.0\]) 12. print(\"**\\n**Float Datatype: \") 13. print(x.dtype) 14. *\# Forced Datatype* 15. x = np.array(\[1, 2\], dtype = np.int64) 16. print(\"**\\n**Forcing a Datatype: \") 17. print(x.dtype) 18. **Output** Integer Datatype: int64 Float Datatype: float64 Forcing a Datatype: int64 **Math Operations on DataType array** In Numpy arrays, basic mathematical operations are performed element-wise on the array. These operations are applied both as operator overloads and as functions. Many useful functions are provided in Numpy for performing computations on Arrays such as **sum**: for addition of Array elements, **T**: for Transpose of elements, etc. ***Exercise 5: Python Program to create a data type object*** 1. *\# Python Program to create* 2. *\# a data type object* 3. **import** **numpy** **as** **np** 4. *\# First Array* 5. arr1 = np.array(\[\[4, 7\], \[2, 6\]\], i. dtype = np.float64) 6. *\# Second Array* 7. arr2 = np.array(\[\[3, 6\], \[2, 8\]\], ii. dtype = np.float64) 8. *\# Addition of two Arrays* 9. Sum = np.add(arr1, arr2) 10. print(\"Addition of Two Arrays: \") 11. print(Sum) 12. *\# Addition of all Array elements* 13. *\# using predefined sum method* 14. Sum1 = np.sum(arr1) 15. print(\"**\\n**Addition of Array elements: \") 16. print(Sum1) 17. *\# Square root of Array* 18. Sqrt = np.sqrt(arr1) 19. print(\"**\\n**Square root of Array1 elements: \") 20. print(Sqrt) 21. *\# Transpose of Array* 22. *\# using In-built function \'T\'* 23. Trans\_arr = arr1.T 24. print(\"**\\n**Transpose of Array: \") 25. print(Trans\_arr) **Output** Addition of Two Arrays: \[\[ 7. 13.\] \[ 4. 14.\]\] Addition of Array elements: 19.0 Square root of Array1 elements: \[\[2. 2.64575131\] \[1.41421356 2.44948974\]\] Transpose of Array: \[\[4. 2.\] \... ***Exercise 6: Creating an array using Python list for One Dimensional Array*** 1. *\# importing numpy module* 2. **import** **numpy** **as** **np** 3. *\# creating list* 4. list = \[1, 2, 3, 4\] 5. *\# creating numpy array* 6. sample\_array = np.array(list) 7. print(\"List in python : \", list) 8. print(\"Numpy Array in python :\", 9. sample\_array) **Output:** List in python : \[1, 2, 3, 4\]\ Numpy Array in python : \[1 2 3 4\] ***Exercise 7: Creating Array using Python lists for two dimensional Arrays*** 1. *\# importing numpy module* 2. **import** **numpy** **as** **np** 3. *\# creating list* 4. list\_1 = \[1, 2, 3, 4\] 5. list\_2 = \[5, 6, 7, 8\] 6. list\_3 = \[9, 10, 11, 12\] 7. *\# creating numpy array* 8. sample\_array = np.array(\[list\_1, 9. list\_2, 10. list\_3\]) 11. print(\"Numpy multi dimensional array in python**\\n**\", 12. sample\_array) **Output:** Numpy multi dimensional array in python\ \[\[ 1 2 3 4\]\ \[ 5 6 7 8\]\ \[ 9 10 11 12\]\] **Note:** use **\[ \]** operators inside numpy.array() for multi-dimensional **2.5. Anatomy of an Array** The Axis of an array describes the order of the indexing into the array. *Axis 0 = one dimensional* *Axis 1 = Two dimensional* *Axis 2 = Three dimensional * **The Method Shape** The NumPy method **Shape gives** the number of elements along with each axis. It is from a tuple. ***Exercise 8: Using the method shape*** 1. *\# importing numpy module* 2. **import** **numpy** **as** **np** 3. *\# creating list* 4. list\_1 = \[1, 2, 3, 4\] 5. list\_2 = \[5, 6, 7, 8\] 6. list\_3 = \[9, 10, 11, 12\] 7. *\# creating numpy array* 8. sample\_array = np.array(\[list\_1, list\_2, list\_3\]) 9. print(\"Numpy array :\") 10. print(sample\_array) 11. *\# print shape of the array* 12. print(\"Shape of the array :\", 13. sample\_array.shape) **Output:**  Numpy array :\ \[\[ 1 2 3 4\]\ \[ 5 6 7 8\]\ \[ 9 10 11 12\]\]\ Shape of the array : (3, 4) ***Exercise 9: Using the method shape for 2-dimentional array*** 1. **import** **numpy** **as** **np** 2. sample\_array = np.array(\[\[0, 4, 2\], \[3, 4, 5\], \[23, 4, 5\], \[2, 34, 5\], \[5, 6, 7\]\]) 3. print(\"shape of the array :\", 4. sample\_array.shape) **Output:** shape of the array : (5, 3) **Lab (3):** CSV Files **Lab (3) Description:** This lab describes how can we read from and write to a CSV file. **Lab (3): Learning Outcomes** By the end of this lab the student must be able to: - Use python read from and write to a CSV file - Write codes that read from and write to a CSV file. **Lab (3): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python read from and write to a CSV file. - Supervise students while they write and run the Python codes in this lab. **3.1. What is a CSV?** Understanding how to read CSV files in Python is essential for any data scientist. CSV, which stands for "Comma Separated Values," serves as the fundamental format for storing tabular data as plain text. As data scientists, we frequently encounter CSV data in our daily workflows. Therefore, mastering the ability to read CSV files in Python is crucial for efficiently handling and analyzing [data sets](https://www.analyticsvidhya.com/bbplus). **3.2. Structure of CSV in Python** We have a file named "**Salary\_Data.csv.**" The first line of a CSV file is the header. It contains the names of the fields/features, which are shown on top as the column names in the file. After the header, each line of the file is an observation/a record. The values of a record are separated by "commas." Structure of CSV in Python **3.3. Read csv file in Python** There are two main ways to read CSV files in Python: - **Using the csv module: **This is the built-in module for working with CSV files in Python. It provides basic functionality for reading and writing CSV data. Here's an example of how to read a CSV file using csv.reader: ***Exercise 1: Read a CSV file using csv.reader*** 1. import csv 2. \# Open the CSV file in read mode 3. with open(\'data.csv\', \'r\') as csvfile: 4. \# Create a reader object 5. csv\_reader = csv.reader(csvfile) 6. \# Iterate through the rows in the CSV file 7. for row in csv\_reader: 8. \# Access each element in the row 9. print(row) - **Using the Pandas library**: Pandas is a powerful library for data analysis in Python. It offers a more convenient way to read and manipulate CSV data. Here's an example of how to read a CSV file using Pandas: 1. import pandas as pd 2. \# Read the CSV file into a DataFrame 3. df = pd.read\_csv(\'data.csv\') 4. \# Access data in the DataFrame using column names or indexing 5. print(df\[\'column\_name\'\]) 6. print(df.iloc\[0\]) \# Access first row List of Methods to Read a CSV File in Python are: - Read CSV file using csv.reader - Read CSV file using.readlines() function - Read CSV file using Pandas - Read CSV file using csv.DictReader **3.4 How to Read CSV Files in Python with Procedural Steps?** There are many different ways to read data in a CSV file, which we will now see one by one. You can read CSV files using the **csv.reader** object from Python's **csv** module. Steps to read a CSV file using csv reader: 1. **Import the CSV library** ***import csv*** 2. **Open the CSV file** The.**open()** method in python is used to open files and return a file object.\ \ ***file = open(\'Salary\_Data.csv\')\ *** The type of file is "**\_io.TextIOWrapper**" which is a file object that is returned by the **open() **method. 3. **Use the csv.reader object to read the CSV file** ***csvreader = csv.reader(file)*** 4. **Extract the field names** Create an empty list called a header. Use the next() method to obtain the header.\ The.next() method returns the current row and moves to the next row.\ The first time you run next(), it returns the header, and the next time you run, it returns the first record, and so on.\ \ ***header = \[\]\ header = next(csvreader)***\ The header is: ![Field names in CSV header \[python read csv\]](media/image8.jpeg) 5. **Extract the rows/records** Create an empty list called rows and iterate through the csvreader object and append each row to the rows list.\ \ ***rows = \[\]\ for row in csvreader:\ rows.append(row)*** ***rows\ *** 6. **Close the file** **.close()** method is used to close the opened file. Once it is closed, we cannot perform any operations on it.\ \ ***file.close()*** ***Exercise 2: Complete Code for Read CSV Python*** 1. *import csv* 2. *file = open(\"Salary\_Data.csv\")* 3. *csvreader = csv.reader(file)* 4. *header = next(csvreader)* 5. *print(header)* 6. *rows = \[\]* 7. *for row in csvreader:* 8. *rows.append(row)* 9. *print(rows)* 10. *file.close()* **Output** Naturally, we might forget to close an open file. To avoid that, we can use the **with() **statement to automatically release the resources. In simple terms, there is no need to call the.**close()** method if we are using **with()** statement. ***Exercise 3: Implementing Code Using with() Statement*** **Basic Syntax: **with open(filename, mode) as alias\_filename: **Modes:** - 'r' -- to read an existing file, - 'w' -- to create a new file if the given file doesn't exist and write to it, - 'a' -- to append to existing file content, - '+' --  to create a new file for reading and writing 1. import csv 2. rows = \[\] 3. with open(\"Salary\_Data.csv\", \'r\') as file: 4. csvreader = csv.reader(file) 5. header = next(csvreader) 6. for row in csvreader: 7. rows.append(row) 8. print(header) 9. print(rows) **Output** ![](media/image10.png) ***Exercise 4: Python code to read and print part of a csv file*** 1. *\# importing csv module* 2. **import** **csv** 3. *\# csv file name* 4. filename = \"aapl.csv\" 5. *\# initializing the titles and rows list* 6. fields = \[\] 7. rows = \[\] 8. *\# reading csv file* 9. **with** open(filename, \'r\') **as** csvfile: 10. *\# creating a csv reader object* 11. csvreader = csv.reader(csvfile) 12. *\# extracting field names through first row* 13. fields = next(csvreader) 14. *\# extracting each data row one by one* 15. **for** row **in** csvreader: 16. rows.append(row) 17. *\# get total number of rows* 18. print(\"Total no. of rows: **%d**\" % (csvreader.line\_num)) 19. *\# printing the field names* 20. print(\'Field names are:\' + \', \'.join(field **for** field **in** fields)) 21. *\# printing first 5 rows* 22. print(\'**\\n**First 5 rows are:**\\n**\') 23. **for** row **in** rows\[:5\]: 24. *\# parsing each column of a row* 25. **for** col **in** row: 26. print(\"**%10s**\" % col, end=\" \"), 27. print(\'**\\n**\') **Output:** https://media.geeksforgeeks.org/wp-content/uploads/20230315235331/Screenshot-2023-03-15-235320.png The above example uses a CSV file aapl.csv which can be downloaded from Let us try to understand the above code. ***with open(filename, \'r\') as csvfile:\ csvreader = csv.reader(csvfile***) - Here, we first open the CSV file in READ mode. The file object is named as **csvfile **. The file object is converted to csv.reader object. We save the csv.reader object as **csvreader.** ***fields = csvreader.next()*** - **csvreader **is an iterable object. Hence,.next() method returns the current row and advances the iterator to the next row. Since, the first row of our csv file contains the headers (or field names), we save them in a list called **fields **. ***for row in csvreader:\ rows.append(row)*** - Now, we iterate through the remaining rows using a for loop. Each row is appended to a list called **rows **. If you try to print each row, one can find that a row is nothing but a list containing all the field values. ***print(\"Total no. of rows: %d\"%(csvreader.line\_num))*** - **csvreader.line\_num **is nothing but a counter which returns the number of rows that have been iterated. **3.5 Reading CSV Files into a Dictionary With csv** We can read a CSV file into a dictionary using the csv module in Python and the csv.DictReader class. Here's an example: Suppose, we have a **employees.csv **file and content inside it will be: name,department,birthday\_month\ John Smith,HR,July\ Alice Johnson,IT,October\ Bob Williams,Finance,January In this example, **[csv.DictReader reads each row of the CSV file as a dictionary]** where the keys are the column headers, and the values are the corresponding values in each row. The dictionaries are then appended to a list ( data\_list in this example). ***Exercise 6: Read a CSV file into a dictionary using the csv module in Python and the csv.DictReader class.*** 1. **import** **csv** 2. *\# Open the CSV file for reading* 3. **with** open(\'employees.csv\', mode=\'r\') **as** file: 4. *\# Create a CSV reader with DictReader* 5. csv\_reader = ***csv.DictReader(file***) 6. *\# Initialize an empty list to store the dictionaries* 7. data\_list = \[\] 8. *\# Iterate through each row in the CSV file* 9. **for** row **in** csv\_reader: 10. *\# Append each row (as a dictionary) to the list* 11. data\_list.append(row) 12. *\# Print the list of dictionaries* 13. **for** data **in** data\_list: 14. print(data) **Output:** {\'name\': \'John Smith\', \'department\': \'HR\', \'birthday\_month\': \'July\'}\ {\'name\': \'Alice Johnson\', \'department\': \'IT\', \'birthday\_month\': \'October\'}\ {\'name\': \'Bob Williams\', \'department\': \'Finance\', \'birthday\_month\': \'January\'} **3.6 Writing to a CSV file** To write to a CSV file, we first open the CSV file in WRITE mode. The file object is converted to csv.writer object and further operations takes place. Code and detailed explanation is given below. ***Exercise 6: Python code to write to a CSV file*** 1. *\# importing the csv module* 2. **import** **csv** 3. *\# field names* 4. fields = \[\'Name\', \'Branch\', \'Year\', \'CGPA\'\] 5. *\# data rows of csv file* 6. rows = \[\[\'Nikhil\', \'COE\', \'2\', \'9.0\'\], 7. \[\'Sanchit\', \'COE\', \'2\', \'9.1\'\], 8. \[\'Aditya\', \'IT\', \'2\', \'9.3\'\], 9. \[\'Sagar\', \'SE\', \'1\', \'9.5\'\], 10. \[\'Prateek\', \'MCE\', \'3\', \'7.8\'\], 11. \[\'Sahil\', \'EP\', \'2\', \'9.1\'\]\] 12. *\# name of csv file* 13. filename = \"university\_records.csv\" 14. *\# writing to csv file* 15. **with** open(filename, \'w\') **as** csvfile: 16. *\# creating a csv writer object* 17. csvwriter = csv.writer(csvfile) 18. *\# writing the fields* 19. csvwriter.writerow(fields) 20. *\# writing the data rows* 21. csvwriter.writerows(rows) **Let us try to understand the above code in pieces.** - **fields **and **rows **have been already defined. fields is a list containing all the field names. **rows **is a list of lists. Each row is a list containing the field values of that row. ***with open(filename, \'w\') as csvfile:\ csvwriter = csv.writer(csvfile)*** - Here, we first open the CSV file in WRITE mode. The file object is named as **csvfile **. The file object is converted to csv.writer object. We save the csv.writer object as **csvwriter **. ***csvwriter.writerow(fields)*** - Now we use **writerow **method to write the first row which is nothing but the field names. ***csvwriter.writerows(rows)*** - We use **writerows **method to write multiple rows at once. **3.7 Writing a dictionary to a CSV file** To write a dictionary to a CSV file, the file object (csvfile) is converted to a DictWriter object. Detailed example with explanation and code is given below. ***Exercise 8: Writing a dictionary to CSV file.*** 1. *\# importing the csv module* 2. **import** **csv** 3. *\# my data rows as dictionary objects* 4. mydict = \[{\'branch\': \'COE\', \'cgpa\': \'9.0\', a. \'name\': \'Nikhil\', \'year\': \'2\'}, 5. {\'branch\': \'COE\', \'cgpa\': \'9.1\', b. \'name\': \'Sanchit\', \'year\': \'2\'}, 6. {\'branch\': \'IT\', \'cgpa\': \'9.3\', c. \'name\': \'Aditya\', \'year\': \'2\'}, 7. {\'branch\': \'SE\', \'cgpa\': \'9.5\', d. \'name\': \'Sagar\', \'year\': \'1\'}, 8. {\'branch\': \'MCE\', \'cgpa\': \'7.8\', e. \'name\': \'Prateek\', \'year\': \'3\'}, 9. {\'branch\': \'EP\', \'cgpa\': \'9.1\', f. \'name\': \'Sahil\', \'year\': \'2\'}\] 10. *\# field names* 11. fields = \[\'name\', \'branch\', \'year\', \'cgpa\'\] 12. *\# name of csv file* 13. filename = \"university\_records.csv\" 14. *\# writing to csv file* 15. **with** open(filename, \'w\') **as** csvfile: 16. *\# creating a csv dict writer object* 17. writer = csv.DictWriter(csvfile, fieldnames=fields) 18. *\# writing headers (field names)* 19. writer.writeheader() 20. *\# writing data rows* 21. writer.writerows(mydict) In this example, we write a dictionary **mydict **to a CSV file. with open(filename, \'w\') as csvfile:\ ***writer = csv.DictWriter(csvfile, fieldnames = fields)*** - Here, the file object ( **csvfile **) is converted to a DictWriter object. Here, we specify the **fieldnames **as an argument. ***writer.writeheader()*** - writeheader method simply writes the first row of your csv file using the pre-specified fieldnames. ***writer.writerows(mydict)*** - **writerows **method simply writes all the rows but in each row, it writes only the values(not keys). - So, in the end, our CSV file looks like this: ![https://media.geeksforgeeks.org/wp-content/cdn-uploads/20210722192533/csv2.png](media/image12.png) *csv file* Consider that a CSV file looks like this in plain text: university record *university record* - We notice that the delimiter is not a comma but a semi-colon. Also, the rows are separated by two newlines instead of one. In such cases, we can specify the delimiter and line terminator. **Lab (4):** **Pandas Package** **Lab (4) Description:** This lab is aimed to provide basic knowledge of Python Pandas Package. **Lab (4): Learning Outcomes** By the end of this lab the student must be able to: - Use Pandas Package to manipulate and explore data sets. - Write C codes that implement Pandas Package. **Lab (4): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python Pandas. - Supervise students while they write and run the Python Pandas codes in this lab. **4.1 What is Pandas?** Pandas is arguably the most important Python package for data analysis. With over 100 million downloads per month, it is the de facto standard package for data manipulation and exploratory data analysis. Its ability to read from and write to an extensive list of formats makes it a versatile tool for data science practitioners. Its data manipulation functions make it a highly accessible and practical tool for aggregating, analyzing, and cleaning data.  **P**andas **is a data manipulation package in Python for tabular data**. That is, data in the form of rows and columns, also known as DataFrames. Intuitively, you can think of a DataFrame as an Excel sheet.  **[pandas' functionality includes data transformations, like ][sorting rows](https://www.datacamp.com/tutorial/pandas-sort-values)[ and taking subsets, to calculating summary statistics such as the mean, reshaping DataFrames, and joining DataFrames together]**. pandas works well with other popular Python data science packages, often called the PyData ecosystem, including - [**NumPy**](https://www.datacamp.com/tutorial/python-numpy-tutorial) for numerical computing - **[Matplotlib](https://www.datacamp.com/tutorial/matplotlib-tutorial-python),[ Seaborn](https://www.datacamp.com/tutorial/seaborn-python-tutorial),[ Plotly](https://www.datacamp.com/courses/introduction-to-data-visualization-with-plotly-in-python)**, and other data visualization packages - **[scikit-learn](https://www.datacamp.com/tutorial/machine-learning-python) **for machine learning **4.2 What is pandas used for?** pandas is used throughout the data analysis workflow. With pandas, you can: - Import datasets from databases, spreadsheets, comma-separated values (CSV) files, and more. - Clean datasets, for example, by dealing with missing values. - Tidy datasets by reshaping their structure into a suitable format for analysis. - Aggregate data by calculating summary statistics such as the mean of columns, correlation between them, and more. - Visualize datasets and uncover insights. pandas also contains functionality for time series analysis and analyzing text data. **4.3 Importing data in pandas** To begin working with pandas, import the pandas Python package as shown below. When importing pandas, the most common alias for pandas is pd. ***import pandas as pd*** All exercises with be done using the dataset diabetes, which you should download from ***Exercise 1: Importing CSV files*** Use read\_csv() with the path to the CSV file to read a comma-separated values file ***df = pd.read\_csv(\"diabetes.csv\")*** **[Explanation:]** This code reads a CSV file named \"diabetes.csv\" and stores its contents in a pandas DataFrame object named \"df\". The pandas library is used in this code, and it provides a function called \"***read\_csv()\"*** that can read CSV files and create a DataFrame object from them. The file path of the CSV file is passed as an argument to the function. Once the file is read, its contents are stored in the DataFrame object \"df\", which can be used for further analysis and manipulation. ***Exercise 2: Importing text files*** Reading text files is similar to CSV files. The only nuance is that you need to specify a separator with the sep argument, as shown below. The separator argument refers to the symbol used to separate rows in a DataFrame. Comma (sep = \",\"), whitespace (sep = \"\\s\"), tab (sep = \"\\t\"), and colon(sep = \":\") are the commonly used separators. Here \\s represents a single white space character. ***df = pd.read\_csv(\"diabetes.txt\", sep=\"\\s\")*** ***Exercise3: Importing Excel files (single sheet)*** Reading excel files (both XLS and XLSX) is as easy as the read\_excel() function, using the file path as an input. ***df = pd.read\_excel(\'diabetes.xlsx\')*** ***Exercise 4: Importing Excel files (multiple sheets)*** Reading Excel files with multiple sheets is not that different. You just need to specify one additional argument, sheet\_name, where you can either pass a string for the sheet name or an integer for the sheet position (note that Python uses 0-indexing, where the first sheet can be accessed with sheet\_name = 0). \# Extracting the second sheet since Python uses 0-indexing ***df = pd.read\_excel(\'diabetes\_multi.xlsx\', sheet\_name=1)*** **4.4 Outputting data in pandas** Just as pandas can import data from various file types, it also allows you to export data into various formats. This happens especially when data is transformed using pandas and needs to be saved locally on your machine. Below is how to output pandas DataFrames into various formats. ***Exercise 5: Outputting a DataFrame into a CSV file*** A pandas DataFrame (here we are using df) is saved as a CSV file using the .to\_csv() method. The arguments include the filename with path and index -- where index = True implies writing the DataFrame's index. ***df.to\_csv(\"diabetes\_out.csv\", index=False)*** ***Exercise 6: Outputting a DataFrame into a text file*** As with writing DataFrames to CSV files, you can call .to\_csv(). The only differences are that the output file format is in .txt, and you need to specify a separator using the sep argument. ***df.to\_csv(\'diabetes\_out.txt\', header=df.columns, index=None, sep=\' \')*** ***Exercise 7: Outputting a DataFrame into an Excel file*** Call .to\_excel() from the DataFrame object to save it as a ".xls" or ".xlsx" file. ***df.to\_excel(\"diabetes\_out.xlsx\", index=False)*** ***Exercise 8: How to view data using .head() *** You can view the first few or last few rows of a DataFrame using ***the .head() or .tail() methods***, respectively. You can specify the number of rows through the n argument (the default value is 5). ***df.head()*** ![First five rows of the DataFrame (df) using.head()](media/image14.png) *First five rows of the DataFrame* **[Explanation:]** This code is written in Python and it calls the head() method on a Pandas DataFrame object named df. The head() method is used to display the first few rows of the DataFrame. By default, it displays the first 5 rows, but you can pass an integer argument to display a different number of rows. ***Exercise 9: How to view data using .tail()*** ***df.tail(n = 10)*** last 10 rows of a Dataframe with df.head() *First 10 rows of the DataFrame* **[Explanation]** This code is written in Python and it uses the tail() method to display the last 10 rows of a DataFrame df. The n parameter is set to 10 to specify the number of rows to display. The tail() method is commonly used to quickly check the last few rows of a DataFrame to ensure that the data has been loaded correctly or to get a quick overview of the data. ***Exercise 10: Understanding data using .describe()*** The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns. ***df.describe()*** ![Get summary statistics with.describe()](media/image16.png) *Get summary statistics with *.describe() It gives a quick look at the scale, skew, and range of numeric data. You can also modify the quartiles using the percentiles argument. Here, for example, we're looking at the 30%, 50%, and 70% percentiles of the numeric columns in DataFrame df. ***df.describe(percentiles=\[0.3, 0.5, 0.7\])*** Get summary statistics with specific percentiles  pandas *Get summary statistics with specific percentiles* You can also isolate specific data types in your summary output by using the include argument. Here, for example, we're only summarizing the columns with the integer data type.  ***df.describe(include=\[int\])*** ![summary statistics of integer columns only pandas](media/image18.png) *Get summary statistics of integer columns only* **[Explanation]** This code is written in Python and it uses the describe() method of a Pandas DataFrame object to generate descriptive statistics of the data in the DataFrame. The **[exclude]** parameter is used to exclude certain data types from the analysis. In this case, the **[exclude=\[int\] parameter is used to exclude integer columns from the analysis.]** This means that the describe() method will only generate statistics for non-integer columns in the DataFrame. Similarly, you might want to exclude certain data types using exclude argument. ***df.describe(exclude=\[int\])*** get summary statistics of non-integer columns pandas *Get summary statistics of non-integer columns only* Often, practitioners find it easy to view such statistics by transposing them with the .T attribute. ***df.describe().T*** ![Transpose summary statistics pandas](media/image20.png) *Transpose summary statistics with *.T **[Explanation]** This code uses the ***describe() method*** to generate summary statistics of a pandas DataFrame df. The ***T attribute*** is then used to transpose the resulting summary statistics table, so that the rows become columns and vice versa. This makes it easier to read and compare the statistics for different columns. For example, if df has columns for \"age\", \"income\", and \"education\", the resulting table will have rows for \"count\", \"mean\", \"std\", \"min\", \"25%\", \"50%\", \"75%\", and \"max\", and columns for \"age\", \"income\", and \"education\". Overall, this code is useful for quickly getting an overview of the distribution and range of values in a DataFrame. **4.5 Understanding Data using Pandas** The following exercises explore pandas data analysis tools used to understand data. ***Exercise 11: Understanding data using .info()*** **[The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame]**. In the following exercise we're setting the parameters ***show\_counts***  to True, which gives a few over the total non-missing values in each column. We're also setting ***memory\_usage*** to True, which shows the total memory usage of the DataFrame elements. When ***verbose*** is set to True, it prints the full summary from .info().  ***df.info(show\_counts=True, memory\_usage=True, verbose=True)*** image26.png ***Exercise 12: Understanding your data using .shape*** **The number of rows and columns of a DataFrame can be identified using the .shape attribute of the DataFrame**. It returns a tuple (row, column) and can be indexed to get only rows, and only columns count as output. ***df.shape \# Get the number of rows and columns*** ***df.shape\[0\] \# Get the number of rows only*** ***df.shape\[1\] \# Get the number of columns only*** (768,9) 768 9 ***Exercise 13: Get all columns and column names*** **Calling the .columns attribute of a DataFrame object returns the column names in the form of an Index object**. As a reminder, a pandas index is the address/label of the row or column. ***df.columns*** ![Output of columns](media/image22.png) It can be converted to a list using a list() function. ***list(df.columns)*** column names as a list pandas **4.6 Checking for missing values in pandas with .isnull()** **The sample DataFrame does not have any missing values**. So make a data set with missing data, let\'s introduce a few to make things interesting. ***The .copy() method*** makes a copy of the original DataFrame. This is done to ensure that any changes to the copy don't reflect in the original DataFrame. Using .loc (to be discussed later), you can set rows two to five of the Pregnancies column to NaN values, which denote missing values. ***Exercise 14: Checking missing data*** ***df2 = df.copy()*** ***df2.loc\[2:5,\'Pregnancies\'\] = None*** ***df2.head(7)*** ![Rows 2 to 5 are missing pandas](media/image24.png) *You can see, that now rows 2 to 5 are *NaN You can check whether each element in a DataFrame is missing using ***the .isnull() method***. ***df2.isnull().head(7)*** It is often more useful to know how much missing data you have, you can combine .isnull() ***with .sum()*** to count the number of nulls in each column. ***df2.isnull().sum()*** Pregnancies 4 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64 You can also do a double sum **to get the total number of nulls in the DataFrame**. ***df2.isnull().sum().sum()*** **4.7 Slicing and Extracting Data in pandas** The pandas package offers several ways to subset, filter, and isolate data in your DataFrames. Here, we\'ll see the most common ways. ***Exercise 15: Isolating one column using \[ \] *** **You can isolate a single column using a square bracket \[ \] with a column name in it.** **[The output is a pandas Series object. A pandas Series is a one-dimensional array containing data of any type, including integer]**, float, string, boolean, python objects, etc. A **DataFrame is comprised of many series that act as columns**. ***df\[\'Outcome\'\]*** Isolating one column in pandas *Isolating one column in pandas* ***Exercise 16: Isolating two or more columns using \[\[ \]\] *** You can also provide a list of column names inside the square brackets to fetch more than one column. Here, square brackets are used in two different ways. We use the **[outer square brackets to indicate a subset of a DataFrame]**, and the **[inner square brackets to create a list]**. ***df\[\[\'Pregnancies\', \'Outcome\'\]\]*** ![image15.png](media/image26.png) *Isolating two columns in pandas* **Exercise 17: Fetching and Returning one row using \[ \] ** A single row can be fetched by passing in a boolean series with one True value. In the example below, the second row with index = 1 is returned. Here, .index returns the row labels of the DataFrame, and the comparison turns that into a Boolean one-dimensional array. ***df\[df.index==1\]*** Isolating one row in pandas *Isolating one row in pandas* ***Exercise 18: Fetching two or more rows using \[ \] *** Similarly, two or more rows can be returned using the .isin() method instead of a == operator. ***df\[df.index.isin(range(2,10))\]*** ![Isolating specific rows in pandas](media/image28.png) *Fetching specific rows in pandas* ***Exercise 19: Using .loc\[\] and .iloc\[\] to fetch rows*** You can fetch specific rows by labels or conditions using .loc\[\] and .iloc\[\] (\"location\" and \"integer location\"). .loc\[\] uses a label to point to a row, column or cell, whereas .iloc\[\] uses the numeric position. To understand the difference between the two, let's modify the index of df2 created earlier. ***df2.index = range(1,769)*** The below example returns a pandas Series instead of a DataFrame. The 1 represents the row index (label), whereas the 1 in.iloc\[\] is the row position (first row). ***df2.loc\[1\]*** Pregnancies 6.000 Glucose 148.000 BloodPressure 72.000 SkinThickness 35.000 Insulin 0.000 BMI 33.600 DiabetesPedigreeFunction 0.627 Age 50.000 Outcome 1.000 Name: 1, dtype: float64 ***df2.iloc\[1\]*** Pregnancies 1.000 Glucose 85.000 BloodPressure 66.000 SkinThickness 29.000 Insulin 0.000 BMI 26.600 DiabetesPedigreeFunction 0.351 Age 31.000 Outcome 0.000 Name: 2, dtype: float64 You can also fetch multiple rows by providing a range in square brackets. ***df2.loc\[100:110\]*** Isolating rows with loc *Isolating rows in pandas with *.loc\[\] ***df2.iloc\[100:110\]*** ![Isolating rows in pandas with.loc\[\]](media/image30.png) *Isolating rows in pandas with *.iloc\[\] You can also subset with .loc\[\] and .iloc\[\] by using a list instead of a range. df2.loc\[\[100, 200, 300\]\] [**Powered By**](https://www.datacamp.com/datalab) Isolating rows using a list in pandas with.loc\[\] *Isolating rows using a list in pandas with *.loc\[\] ***df2.iloc\[\[100, 200, 300\]\]*** ![image25.png](media/image32.png) *Isolating rows using a list in pandas with *.iloc\[\] You can also select specific columns along with rows. This is where .iloc\[\] is different from .loc\[\] -- it requires column location and not column labels. ***df2.loc\[100:110, \[\'Pregnancies\', \'Glucose\', \'BloodPressure\'\]\]*** Isolating columns using a list in pandas with.loc\[\] *Isolating columns in pandas with *.loc\[\] ***df2.iloc\[100:110, :3\]*** ![Isolating columns using in pandas with.iloc\[\]](media/image34.png) *Isolating columns with *.iloc\[\] **Lab (5): Cleaning of Data Using Pandas** **Lab (5) Description:** This lab is aimed to provide basic knowledge on cleaning Data Using Pandas. **Lab (5): Learning Outcomes** By the end of this lab the student must be able to: - Use Pandas Package cleaning Data. Write codes that implement pandas cleaning tools. **Lab (5): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python for cleaning data using pandas. - Supervise students while they write and run the Python codes in this lab. **5.1 Cleaning data using pandas ** Data cleaning is one of the most common tasks in data science. pandas lets you **preprocess data for any use, including but not limited to training machine learning and deep learning models**. Let's use the DataFrame df2 from earlier, having four missing values, to illustrate a few data cleaning use cases. As a reminder, here\'s how you can see how many missing values are in a DataFrame. ***df2.isnull().sum()*** Pregnancies 4 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64 ***Exercise 1: Dealing with missing data technique \#1: Dropping missing values*** One way to deal with missing data is to drop it. This is particularly useful in cases where you have plenty of data and losing a small portion won't impact the downstream analysis. You can use a .dropna() method as shown below. Here, we are saving the results from .dropna() into a DataFrame df3. ***df3 = df2.copy()*** ***df3 = df3.dropna()*** ***df3.shape*** (764, 9) \# this is 4 rows less than df2 The axis argument lets you specify whether you are dropping rows, or [**columns**](https://www.datacamp.com/tutorial/pandas-drop-column), with missing values. The default axis removes the rows containing NaNs. Use axis = 1 to remove the columns with one or more NaN values. Also, notice how we are using the argument inplace=True which lets you skip saving the output of .dropna() into a new DataFrame.   ***df3 = df2.copy()*** ***df3.dropna(inplace=True, axis=1)*** ***df3.head()*** Dropping missing data pandas ***Exercise 2: Dropping missing data in pandas*** You can also drop both rows and columns with missing values by setting the how argument to \'all\' ***df3 = df2.copy()*** ***df3.dropna(inplace=True, how=\'all\')*** ***Exercise 3: Dealing with missing data technique \#2: Replacing missing values*** **Instead of dropping, replacing missing values with a summary statistic or a specific value (depending on the use case) maybe the best way to go**. For example, if there is one missing row from a temperature column denoting temperatures throughout the days of the week, replacing that missing value with the average temperature of that week may be more effective than dropping values completely. You can replace the missing data with the row, or column mean using the code below. ***df3 = df2.copy()*** ***\# Get the mean of Pregnancies*** ***mean\_value = df3\[\'Pregnancies\'\].mean()*** ***\# Fill missing values using.fillna()*** ***df3 = df3.fillna(mean\_value)*** ***Exercise 4: Dealing with Duplicate Data*** Let\'s add some duplicates to the original data to learn how to eliminate duplicates in a DataFrame. Here, ***we are using the .concat() method to concatenat***e the rows of the df2 DataFrame to the df2 DataFrame, adding perfect duplicates of every row in df2.  ***df3 = pd.concat(\[df2, df2\])*** ***df3.shape*** (1536, 9) You can remove all duplicate rows (default) from the DataFrame [**using .drop\_duplicates() method**](https://www.datacamp.com/tutorial/pandas-drop-duplicates). ***df3 = df3.drop\_duplicates()*** ***df3.shape*** (768, 9) ***Exercise 5 Renaming columns*** A common data cleaning task is renaming columns. **With the .rename() method, you can use columns as an argument to rename specific columns**. The below code shows the dictionary for mapping old and new column names. ***df3.rename(columns = {\'DiabetesPedigreeFunction\':\'DPF\'}, inplace = True)*** ***df3.head()*** ![Renaming columns in pandas](media/image36.png) ***Exercise 6: Renaming columns in pandas*** You can also directly assign column names as a list to the DataFrame. ***df3.columns = \[\'Glucose\', \'BloodPressure\', \'SkinThickness\', \'Insulin\', \'BMI\', \'DPF\', \'Age\', \'Outcome\', \'STF\'\]*** ***df3.head()*** Renaming columns in pandas *Renaming columns in pandas* **5.2 Data analysis in pandas** The main value proposition of pandas lies in its quick data analysis functionality. In this section, we\'ll focus on a set of analysis techniques you can use in pandas. ***Exercise 7: Summary operators (mean, mode, median)*** As you saw earlier, you can get the mean of each column value using the .mean() method. ***df.mean()*** ![Printing the mean of columns in pandas](media/image38.png) *Printing the mean of columns in pandas* A mode can be computed similarly using the .mode() method.  ***df.mode()*** image22.png *Printing the mode of columns in pandas* Similarly, the median of each column is computed with the .median() method ***df.median()*** ![Printing the median of columns in pandas](media/image40.png)\ *Printing the median of columns in pandas* ***Exercise 8: Create new columns based on existing columns *** pandas provides fast and efficient computation by combining two or more columns like scalar variables. The below code divides each value in the column Glucose with the corresponding value in the Insulin column to compute a new column named Glucose\_Insulin\_Ratio. ***df2\[\'Glucose\_Insulin\_Ratio\'\] = df2\[\'Glucose\'\]/df2\[\'Insulin\'\]*** ***df2.head()*** Create a new column from existing columns in pandas *Create a new column from existing columns in pandas* ***Exercise 9: Counting using .value\_counts()*** Often times you\'ll work with categorical values, and you\'ll want to count the number of observations each category has in a column. Category values can be counted using the.value\_counts() methods. Here, for example, we are counting the number of observations where Outcome is diabetic (1) and the number of observations where the Outcome is non-diabetic (0). ***df\[\'Outcome\'\].value\_counts()*** ![Using.value\_counts() in pandas](media/image42.png) **Using .value\_counts() in pandas** Adding the normalize argument returns proportions instead of absolute counts. ***df\[\'Outcome\'\].value\_counts(normalize=True)*** Using.value\_counts() in pandas with normalization ***Exercise 10: Using .value\_counts() in pandas with normalization*** Turn off automatic sorting of results using sort argument (True by default). The default sorting is based on the counts in descending order. df\[\'Outcome\'\].value\_counts(sort=False) ![Using.value\_counts() in pandas with normalization](media/image44.png) ***Exercise 11: Using .value\_counts() in pandas with sorting*** You can also apply .value\_counts() to a DataFrame object and specific columns within it instead of just a column. Here, for example, we are applying value\_counts() on df with the subset argument, which takes in a list of columns.  ***df.value\_counts(subset=\[\'Pregnancies\', \'Outcome\'\])*** Using.value\_counts() in pandas while subsetting columns *Using *.value\_counts()* in pandas while subsetting columns* **Lab (6): Understand and Visualize Your Data** **Lab (6) Description:** This lab is aimed to provide basic knowledge on how to understand and visualize the machine learning dataset. **Lab (6): Learning Outcomes** By the end of this lab the student must be able to: - Understand and visualize the machine learning dataset. - Write Python codes that implement to understand and visualize the machine learning dataset. **Lab (6): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python to understand and visualize the machine learning data. - Supervise students while they write and run the Python codes in this lab. **6.1 Looking at Raw Data** It is important to look at raw data because the insight we will get after looking at raw data will increase our chances to better pre-processing as well as handling of data for ML projects. [Following is a Python script implemented by using ***head()*** function of Pandas DataFrame on **Pima Indians diabetes dataset**] to look at the first 50 rows to get better understanding of it. Download link for the data set: **The datasets consists of several medical predictor variables and one target variable, Outcome**. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. ***Exercise 1: Look to your data*** ![](media/image46.png) **Output** ![](media/image48.png) ***We can observe from the above output that first column gives the row number which can be very useful for referencing a specific observation.*** **6.2 Checking Dimensions of Data** **[It is always a good practice to know how much data, in terms of rows and columns, we are having for our ML project]**. The reasons behind are: Suppose if we have too many rows and columns then it would take long time to run the algorithm and train the model. Suppose if we have too less rows and columns then it we would not have enough data to well train the model. [Following is a Python script implemented by printing the ***shape*** property on Pandas DataFrame]. We are going to implement **[it on iris data set for]** getting the total number of rows and columns in it. ***Exercise 2: Using shape method*** ![](media/image50.png) **Output** ***We can easily observe from the output that iris data set, we are going to use, is having 150 rows and 4 columns.*** **6.3 Getting Each Attribute's Data Type** It is another good practice to know the data type of each attribute. The reason behind is that, sometimes we may need to convert one data type to another. For example, **[we may need to convert string into floating point or int for representing categorial or ordinal values. ]** We can have an idea about the attribute's data type by looking at the raw data, but another way is to use ***dtypes*** property of Pandas DataFrame. With the help of ***dtypes*** property we can categorize each attributes data type. It can be understood with the help of following Python script: ***Exercise 3: Print data types*** ![](media/image52.png) **Output** ***From the above output, we can easily get the data types of each attribute.*** **6.4 Statistical Summary of Data** We have discussed how to get the shape i.e. number of rows and columns, of data but many times we need to review the summaries out of that shape of data. **[It can be done with the help of *describe()* function of Pandas DataFrame that further provide the following 8 statistical properties of each & every data attribute:]** Count Mean Standard Deviation Minimum Value Maximum value 25% Median i.e. 50% 75% ***Exercise 4: Look to the data statistics*** ![](media/image54.png) **Output** ![](media/image56.png) ***From the above output, we can observe the statistical summary of the data of Pima Indian Diabetes dataset along with shape of data.*** **6.5 Reviewing Class Distribution** Class distribution statistics is useful in classification problems where we need to know the balance of class values. It is important to know class value distribution because if **[we have highly imbalanced class distribution i.e. one class is having lots more observations than other class, then it may need special handling at data preparation stage of our ML project]**. We can easily get class distribution in Python with the help of Pandas DataFrame. ***Exercise 5: Get class distribution*** **Output**: ![](media/image58.png) ***From the above output, it can be clearly seen that the number of observations with class 0 are almost double than number of observations with class 1*** **6.6 Reviewing Correlation between Attributes** **[The relationship between two variables is called correlation. In statistics, the most common method for calculating correlation is Pearson's Correlation Coefficient]**. It can have three values as follows: **Coefficient value = 1:** It represents full **positive** correlation between variables. **Coefficient value = -1:** It represents full **negative** correlation between variables. **Coefficient value = 0:** It represents **no** correlation at all between variables. **[It is always good for us to review the pairwise correlations of the attributes in our dataset before using it into ML project because some machine learning algorithms such as linear regression and logistic regression will perform poorly if we have highly correlated attributes]**. In Python, we can easily calculate a correlation matrix of dataset attributes with the help of ***corr()*** function on Pandas DataFrame. ***Exercise 6: Correlation Matrix*** ![](media/image60.png) **Output** ![](media/image62.png) ***The matrix in above output gives the correlation between all the pairs of the attribute in datase***t. **6.7 Understanding Data with Visualization** **[Visualization is another way to understand the data.]** With the help of data **visualization, we can see how the data looks like and what kind of correlation is held by the attributes of data**. It is the fastest way to see if the features correspond to the output. With the help of following Python scripts we can understand ML data with visualization. The following diagram shows the types of data visualization techniques. **Univariate Plots: Understanding Attributes Independently** The simplest type of visualization is single-variable or "univariate" [ ] visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization: **Histograms** Histograms [group the data in bins and are the fastest way to get idea about the distribution of each attribute in dataset.] The following are some of the characteristics of histograms: **[It provides us a count of the number of observations in each bin created for visualization.]** **[From the shape of the bin, we can easily observe the distribution i.e. weather it is Gaussian, skewed or exponential.]** **[ Histograms also help us to see possible outliers.]** ***Exercise 7: Plotting histograms*** The code shown below is an example of Python script creating the histogram of the attributes of **Pima Indian Diabetes dataset.** Here, we will be using ***[hist()]*** function on **Pandas** DataFrame to generate histograms and ***matplotlib*** for ploting them. ![](media/image64.png) **Output** The above output shows that it created the histogram for each attribute in the dataset. **[From this, we can observe that perhaps age, pedi and test attribute may have exponential distribution while mass and plas have Gaussian distribution]**. **Density Plots** Another quick and **[easy technique for getting each attributes distribution is Density plots.]** It is also like histogram but having a smooth curve drawn through the top of each bin. We can call them as abstracted histograms. ***Exercise 8: Plotting Density Plots*** In the following example, Python script **[will generate Density Plots for the distribution of attributes of Pima Indian Diabetes dataset]**. ![](media/image66.png) **Output** **[From the above output, the difference between Density plots and Histograms can be easily understood]**. **Box and Whisker Plots** Box and Whisker **[plots, also called boxplots in short, are another useful technique to review the distribution of each attribute's distribution. The following are the characteristics of this]** technique: It is univariate in nature and summarizes the distribution of each attribute. It draws a line for the middle value i.e. for median. It draws a box around the 25% and 75%. It also draws whiskers which will give us an idea about the spread of the data. The dots outside the whiskers signify the outlier values. ***Exercise 9: Plotting Box Plots*** In the following example, **[Python script will generate Box Plots for the distribution of attributes of Pima Indian Diabetes dataset]**. ![](media/image68.png) **Output** From the above plot of attribute's distribution, it can be observed that age, test and skin appear skewed towards smaller values. **Multivariate Plots: Interaction among Multiple Variables** Another type of visualization is multi-variable or "multivariate" visualization. With the help of multivariate visualization, we can understand interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization: **Correlation Matrix Plot** Correlation is an **[indication about the changes between two variables]**. In our previous labs, we have discussed Pearson's Correlation coefficients and the importance of Correlation too. **[We can plot correlation matrix to show which variable is having a high or low correlation in respect to another variable.]** ***Exercise 10: Plotting Correlation Matrix*** In the following example, Python script will generate and plot correlation matrix for the Pima Indian Diabetes dataset. **[It can be generated with the help of corr() function on Pandas DataFrame and plotted with the help of *pyplot*]***.* ![](media/image70.png) **Output** From the above output of correlation matrix, **we can see that it is symmetrical i.e. the bottom left is same as the top right**. It is also observed that each variable is positively correlated with each other. **Scatter Matrix Plot** Scatter plots shows how much one variable is affected by another or the relationship between them with the help of dots in two dimensions. Scatter plots are very much like line graphs in the concept that they use horizontal and vertical axes to plot data points. ***Exercise 11: Plotting Scatter Matrix*** In the following example, Python script will generate and plot Scatter matrix for the Pima Indian Diabetes dataset. **It can be generated with the help of *scatter\_matrix()* function on Pandas DataFrame and plotted with the help** of ***pyplot.*** ![](media/image72.png) **Output** **Lab (7): Data Preprocessing and Machine learning Modeling** **Using Scikit-learn Library** **Lab (7) Description:** This lab is aimed to provide basic knowledge on data preprocessing and Machine learning modeling using Scikit-learn Library. **Lab (7): Learning Outcomes** By the end of this lab the student must be able to: - Perform data preprocessing and Machine learning modeling using Scikit-learn Library. - Write Python codes that implement Scikit-learn (Sklearn) tools in data preprocessing and modeling. **Lab (7): What Instructor has to do?** The instructor has to: - Follow the steps below to demonstrate to the students how to use Python Scikit-learn tools in data preprocessing and modeling. - Supervise students while they write and run the Scikit-learn codes in this lab. **7.1 Machine Learning** Machine learning is a subfield of [**artificial intelligence**](https://www.datacamp.com/learn/ai) devoted to understanding and building methods to imitate the way humans learn. These methods include the use of algorithms and data to improve the performance on some set of tasks and often fall into one of the three most common types of learning:  - [**Supervised learning**](https://www.datacamp.com/blog/supervised-machine-learning): a type of machine learning that learns the relationship between input and output.  - [**Unsupervised learning**](https://www.datacamp.com/blog/introduction-to-unsupervised-learning): a type of machine learning that learns the underlying structure of an unlabeled dataset.    - [**Reinforcement learning**](https://www.datacamp.com/tutorial/introduction-reinforcement-learning): a method of machine learning wherein the software agent learns to perform certain actions in an environment which lead it to maximum reward. **7.2 What is Scikit-learn?** Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. **7.3 What is Scikit-learn Features?** Rather than focusing on loading, manipulating and summarizing data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows: **Supervised Learning algorithms**: Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn. **Unsupervised Learning algorithms**: On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks. **Clustering:** This model is used for grouping unlabeled data. Cross Validation: It is used to check the accuracy of supervised models on unseen data. **Dimensionality Reduction:** It is used for reducing the number of attributes in data which can be further used for summarization, visualization and feature selection. **Ensemble methods:** As name suggest, it is used for combining the predictions of multiple supervised models. **7.4 Preprocessing the Data using Scikit-learn** As we are dealing with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. **Scikit-learn has package named preprocessing for this purpose.** The preprocessing package has the following techniques: **1. Binarization** This preprocessing technique is used when we need to convert our numerical values into Boolean values. As the name suggests**[, this is the technique with the help of which we can make our data binary]**. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0. ***Exercise 1: Binarization*** 1. *import numpy as np* 2. *from sklearn import preprocessing* 3. *Input\_data = np.array(\[2.1, -1.9, 5.5\],* 4. *\[-1.5, 2.4, 3.5\],* 5. *\[0.5, -7.9, 5.6\],* 6. *\[5.9, 2.3, -5.8\]\])* 7. *data\_binarized = preprocessing.Binarizer(threshold=0.5).transform(input\_data)* 8. *print(\"\\nBinarized data:\\n\", data\_binarized)* In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0. **Output** Binarized data: \[\[ 1. 0. 1.\] \[ 0. 1. 1.\] \[ 0. 0. 1.\] \[ 1. 1. 0.\]\] **2. Standardization or Mean Removal** **[Standardization or mean removal is a technique that simply centers data by removing the average value of each characteristic, and then scales it by dividing non-constant characteristics by their standard deviation]**. It\'s usually beneficial to remove the mean from each feature so that it\'s centered on zero. This helps us remove bias from features. The formula used to achieve this is the following: ![https://static.packt-cdn.com/products/9781789808452/graphics/assets/b5b371c1-e3aa-4d68-ba69-1d1b6f8276db.png](media/image74.png) Standardization results in the rescaling of features, which in turn represents the properties of a standard normal distribution: mean = 0 sd = 1 In this formula, mean is the mean and sd is the standard deviation from the mean. This technique is used to eliminate the mean from feature vector so that every feature centered on zero. The following exercise show how to do it. ***Exercise 2: Standardization or Mean Removal*** 1. *import numpy as np* 2. *from sklearn import preprocessing* 3. *Input\_data = np.array(\[2.1, -1.9, 5.5\],* 4. *\[-1.5, 2.4, 3.5\],* 5. *\[0.5, -7.9, 5.6\],* 6. *\[5.9, 2.3, -5.8\]\])* 7. *\#displaying the mean and the standard deviation of the input data* 8. *print(\"Mean =\", input\_data.mean(axis=0))* 9. *print(\"Stddeviation = \", input\_data.std(axis=0))* 10. *\#Standardize the mean and the standard deviation of the input data* 11. *data\_scaled = preprocessing.scale(input\_data)* 12. *print(\"Mean\_removed =\", data\_scaled.mean(axis=0))* 13. *print(\"Stddeviation\_removed =\", data\_scaled.std(axis=0))* **Output** Mean = \[ 1.75 -1.275 2.2 \] Stddeviation = \[ 2.71431391 4.20022321 4.69414529\] Mean\_removed = \[ 1.11022302e-16 0.00000000e+00 0.00000000e+00\] Stddeviation\_removed = \[ 1. 1. 1.\] You can see that the mean is now almost 0 and the standard deviation is 1. **3. Scaling** [Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling.] Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors require scaled data. [We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library]. We use this preprocessing technique for scaling the feature vectors. Scaling of feature vectors is important, because the features should not be synthetically large or small. ***Exercise 3: Scaling*** 1. *import numpy as np* 2. *from sklearn import preprocessing* 3. *Input\_data = np.array(\[2.1, -1.9, 5.5\],* 4. *\[-1.5, 2.4, 3.5\],* 5. *\[0.5, -7.9, 5.6\],* 6. *\[5.9, 2.3, -5.8\]\])* 7. *data\_scaler\_minmax = preprocessing.MinMaxScaler(feature\_range=(0,1))* 8. *data\_scaled\_minmax = data\_scaler\_minmax.fit\_transform(input\_data)* 9. *print (\"\\nMin max scaled data:\\n\", data\_scaled\_minmax)* **Output** Min max scaled data: \[\[ 0.48648649 0.58252427 0.99122807\] \[ 0. 1. 0.81578947\] \[ 0.27027027 0. 1. \] \[ 1. 0.99029126 0. \]\] **4. Normalization** We use this preprocessing technique for modifying the feature vectors. Normalization of feature vectors is necessary so that the feature vectors can be measured at common scale. There are two types of normalization as follows: **L1 Normalization** It is also called Least Absolute Deviations. It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row. Following example shows the implementation of L1 normalization on input data. ***Exercise 4: Normalization*** 1. *import numpy as np* 2. *from sklearn import preprocessing* 3. *Input\_data = np.array(\[2.1, -1.9, 5.5\],* 4. *\[-1.5, 2.4, 3.5\],* 5. *\[0.5, -7.9, 5.6\],* 6. *\[5.9, 2.3, -5.8\]\])* 7. *data\_normalized\_l1 = preprocessing.normalize(input\_data, norm=\'l1\')* 8. *print(\"\\nL1 normalized data:\\n\", data\_normalized\_l1)* **Output** L1 normalized data: \[\[ 0.22105263 -0.2 0.57894737\] \[-0.2027027 0.32432432 0.47297297\] \[ 0.03571429 -0.56428571 0.4 \] \[ 0.42142857 0.16428571 -0.41428571\]\]Scikit-Learn 12 **L2 Normalisation** Also called Least Squares. It modifies the value in such a manner that the sum of the squares remains always up to 1 in each row. Following example shows the implementation of L2 normalisation on input data. ***Exercise 5: L2 Normalization*** 1. *import numpy as np* 2. *from sklearn import preprocessing* 3. *Input\_data = np.array(\[2.1, -1.9, 5.5\],* 4. *\[-1.5, 2.4, 3.5\],* 5. *\[0.5, -7.9, 5.6\],* 6. *\[5.9, 2.3, -5.8\]\])* 7. *data\_normalized\_l2 = preprocessing.normalize(input\_data, norm=\'l2\')* 8. *print(\"\\nL1 normalized data:\\n\", data\_normalized\_l2)* **Output** L2 normalized data: \[\[ 0.33946114 -0.30713151 0.88906489\] \[-0.33325106 0.53320169 0.7775858 \] \[ 0.05156558 -0.81473612 0.57753446\] \[ 0.68706914 0.26784051 -0.6754239 \]\] **7.4. Machine Learning Modelling Process Using Sklea** This section deals with the modelling process involved in Sklearn, which has many stages. **Stage 1. Dataset Loading** A collection of data is called dataset. It is having the following two components: **[Features]:** The variables of data are called its features. They are also known as predictors, inputs or attributes. **Feature matrix:** It is the collection of features, in case there are more than one. **Feature Names:** It is the list of all the names of the features. **[Response]** It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. **Response Vector:** It is used to represent response column. Generally, we have just one response column. **Target Names:** It represent the possible values taken by a response vector. Scikit-learn have few example datasets like **iris** and **digits** for classification and the **Boston house prices** for regression. ***Exercise 6: Loading iris dataset*** Following is an example to load **iris** dataset: download iris from: 1. *from sklearn.datasets import load\_iris* 2. *iris = load\_iris()* 3. *X = iris.data* 4. *y = iris.target* 5. *feature\_names = iris.feature\_names* 6. *target\_names = iris.target\_names* 7. *print(\"Feature names:\", feature\_names)* 8. *print(\"Target names:\", target\_names)* 9. *print(\"\\nFirst 10 rows of X:\\n\", X\[:10\])* **Output** Feature names: \[\'sepal length (cm)\', \'sepal width (cm)\', \'petal length (cm)\', \'petal width (cm)\'\] Target names: \[\'setosa\' \'versicolor\' \'virginica\'\] First 10 rows of X: \[\[5.1 3.5 1.4 0.2\] \[4.9 3. 1.4 0.2\] \[4.7 3.2 1.3 0.2\] \[4.6 3.1 1.5 0.2\] \[5. 3.6 1.4 0.2\] \[5.4 3.9 1.7 0.4\] \[4.6 3.4 1.4 0.3\] \[5. 3.4 1.5 0.2\] \[4.4 2.9 1.4 0.2\] \[4.9 3.1 1.5 0.1\]\] **Stage2: Splitting the dataset** To check the accuracy of our model, we **can split the dataset into two pieces-a training set and a testing set.** Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did. ***Exercise 7: Splitting data*** The following example will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example. 1. *from sklearn.datasets import load\_iris* 2. *iris = load\_iris()* 3. *X = iris.data* 4. *y = iris.target* 5. *from sklearn.model\_selection import train\_test\_split* 6. *X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.3,* 7. *random\_state=1)* 8. *print(X\_train.shape)* 9. *print(X\_test.shape)* 10. *print(y\_train.shape)* 11. *print(y\_test.shape)* **Output** (105, 4) (45, 4) (105,) (45,) As seen in the example above, it uses ***train\_test\_split***() ***function of scikit-learn to split the dataset.*** This function has the following arguments: - **X, y**: Here, X is the feature matrix and y is the response vector, which need to be split. - **test\_size**: This represents the ratio of test data to the total given data. As in the above example, we are setting test\_data = 0.3 for 150 rows of X. It will produce test data of 150\*0.3 = 45 rows. - **random\_size**: It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results. **Stage 3: Train the Model** **Stage 4: Build the Model** **Stage 5: Evaluate the Model** In the example below, we are going to use KNN (K nearest neighbors) classifier as our machine language algorithm. The classifier will be trained on the training set and then it is used to build the model by predicting the testing set. Finally the classifier is evaluated by evaluating and printing it is accuracy in predicting the test set. ***Exercise 8: Train, builds and evaluate the Model*** 1. *from sklearn.datasets import load\_iris* 2. *iris = load\_iris()* 3. *X = iris.data* 4. *y = iris.target* 5. *from sklearn.model\_selection import train\_test\_split* 6. *X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.4,* 7. *random\_state=1)* 8. *from sklearn.neighbors import KNeighborsClassifier* 9. *from sklearn import metrics* 10. *classifier\_knn = KNeighborsClassifier(n\_neighbors=3)* 11. *classifier\_knn.fit(X\_train, y\_train)* 12. *y\_pred = classifier\_knn.predict(X\_test)* 13. *\# Finding accuracy by comparing actual response values(y\_test)with predicted* 14. *response value(y\_pred)* 15. *print(\"Accuracy:\", metrics.accuracy\_score(y\_test, y\_pred))* 16. *\# Providing sample data and the model will make prediction out of that data* 17. *sample = \[\[5, 5, 3, 2\], \[2, 4, 3, 5\]\]* 18. *preds = classifier\_knn.predict(sample)* 19. *pred\_species = \[iris.target\_names\[p\] for p in preds\] print(\"Predictions:\",* 20. *pred\_species)* **Output** Accuracy: 0.9833333333333333 Predictions: \[\'versicolor\', \'virginica\'\] **Stage 6: Model Persistence** Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of ***joblib package***. Consider the example below in which we will be saving the above trained model (classifier\_knn) for future use: ***Exercise 9: Save the Trained Model*** ***from sklearn.externals import joblib*** ***joblib.dump(classifier\_knn, \'iris\_classifier\_knn.joblib\')*** The above code will save the model into file named iris\_classifier\_knn.joblib. Now, the object can be reloaded from the file with the help of following code: ***joblib.load(\'iris\_classifier\_knn.joblib\')*** **Lab (8): Dataset Feature Selection Techniques** **Lab (8) Description:** This lab is aimed to provide basic knowledge on how to apply feature selection techniques to datasets. **Lab (8): Learning Outcomes** By the end of this lab the student must be able to: - Understand feature selection techniques to datasets - Write Python codes that apply feature selection techniques to datasets **Lab (8): What Instructor has to do** The instructor has to: - Follow the steps below to demonstrate to the students how to apply feature selection techniques to datasets - Supervise students while they write and run the Python codes for applying feature selection techniques to datasets. **8.1 Importance of Data Feature Selection** **The performance of machine learning model is directly proportional to the data features used to train it**. **The performance of ML model will be affected negatively if the data features provided to it are irrelevant**. On the other hand, use of relevant data features can increase the accuracy of your ML model especially linear and logistic regression. **[Now the question is what is feature selection]**? It may be defined as the process with the help of which we select those features in our data that are most relevant to the output or prediction variable in which we are interested. It is also called attribute selection. **[The following are some of the benefits of feature selection before modeling the data:]** Performing feature selection before data modeling **[will reduce the overfitting]**. Performing feature selection before data modeling **will increases the accuracy of ML model.** Performing feature selection before data modeling **will reduce the training time** **8.2 Feature Selection Techniques** The followings are automatic feature selection techniques that we can use to model ML data in Python: **1. Univariate Selection** This feature selection technique is very useful in selecting those features, with the help of statistical testing, having strongest relationship with the prediction variables. **[We can implement univariate feature selection technique with the help of SelectKBest0class of scikit-learn Python library.]** **Example**: In this example, we will use Pima Indians Diabetes dataset **to select 4 of the attributes having best features with the help of chi-square statistical test.** ***Exercise 1: Selection of Best Features using chi-square*** ![](media/image76.png) **Next, we will separate array into input and output components:** **The following lines of code will select the best features from dataset:** ![](media/image78.png) **We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the 4 data attributes with best features along with best score of each attribute**: **Output** ![](media/image80.png) **2. Recursive Feature Elimination** As the name suggests, **RFE (Recursive feature elimination) feature selection technique removes the attributes recursively and builds the model with remaining attributes**. We can implement RFE feature selection technique with the help of **RFE** class of **scikit-learn** Python library. ***Exercise 2: Selection of Best Features using RFE*** **[In this example, we will use RFE with logistic regression algorithm to select the best 3 attributes having the best features from Pima Indians]** Diabetes dataset to. **Next, we will separate the array into its input and output components:** ![](media/image82.png) **The following lines of code will select the best features from a dataset:** **Output** ![](media/image84.png) **[We can see in above output, RFE choose preg, mass and pedi as the first 3 best features. They are marked as 1 in the output.]** **3 Principal Component Analysis (PCA)** **PCA, generally called data reduction technique, is very useful feature selection technique as it uses linear algebra to transform the dataset into a compressed form**. We can implement PCA **feature selection technique with the help of PCA class of scikit-learn Python library. We can select number of principal components in the** output. ***Exercise 3: Applying PCA to select Best Features*** In this example, we will use PCA to select best 3 Principal components from Pima Indians Diabetes dataset. **[Next, we will separate array into input and output components: ]** ![](media/image86.png) **The following lines of code will extract features from dataset:** **Output** ![](media/image88.png) **[We can observe from the above output that 3 Principal Components bear little resemblance to the source data.]** **4. Feature Importance** As the name suggests, **[feature importance technique is used to choose the important features. It basically uses a trained supervised classifier to select features]**. We can implement this feature selection **technique with the help of ExtraTreeClassifier class of scikit-learn Python library.** ***Exercise 4: Selection of Best Features Using* ExtraTreeClassifier Class** **In this example, we will use ExtraTreeClassifier to select features from Pima Indians Diabetes dataset**. **Next, we will separate array into input and output components:** ![](media/image90.png) **The following lines of code will extract features from dataset:** **Output** ![](media/image92.png) **[From the output, we can observe that there are scores for each attribute. The higher the score, higher is the importance of that attribute.]** **8.3 Supervised Machine Learning** A subset of machine learning and artificial intelligence is supervised machine learning. **It is distinguished by the way it trains computers to accurately classify data or predict outcomes using labeled datasets**. The model modifies its weights as input data is fed into it until the model has been properly fitted, which takes place as part of the cross validation process. **[Such as classifying spam in a different folder from your email, supervised learning assists us in finding number of solutions to a real-world issues.]** Two groups of algorithms are used in supervised learning: **Classification:** When the output variable is a category, such as "red" or "blue," "illness" or "no disease."\ **Regression:** When the output variable has a real or continuous value, such as "dollars," "weight," or "wind speed" a regression problem exists. **1. Why We Use Supervised Machine Learning Algorithms?** **We use supervised machine learning algorithms when we have to train models on labeled datasets**. When we wish to map input to output labels **[for classification or regression,]** or when we want to map input to a continuous output, supervised learning is often used. ***[Logistic regression, naive Bayes, support vector machines, artificial neural networks, and random forests are typical supervised learning techniques]***. Finding precise correlations or structures in the input data that enable us to efficiently produce accurate output data is the aim of both classification and regression. **8.4 Classification** **[Classification may be defined as the process of predicting class or category from observed values or given data points.]** The categorized output can have the form such as "Black" or "White" or "spam" or "no spam". Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set. An example of classification problem can be the spam detection in emails. There can be only two categories of output, "spam" and "no spam"; hence this is a binary type classification. **1. Types of Learners in Classification** We have two types of learners in respective to classification problems: **Lazy Learners** As the name suggests, **[such kind of learners waits for the testing data to be appeared after storing the training data]**. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-Nearest Neighbor algorithm. **Eager Learners** As opposite to lazy learners, **eager learners construct classification model without waiting for the testing data to be appeared after storing the training data**. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN). **2.Types of Classification Algorithms** When a discrete outcome can have two distinct values, such as True or False, Default or No Default, or Yes or No, a **classification method is used to predict that outcome**. **This type of approach is known as Binary Classification**. **Multiclass classification refers to the process of determining an outcome from more than two alternative values**. Machine learning classification can be performed using a variety of algorithms. The popular one's are: - Logistic Regression - Naive Bayes Classifier - Decision Tree Classifier - K Nearest Neighbor Classifier - Random Forest Classifier **8.5 The Process of Building a Classification Model with Python** Through the following example we will show the various steps we have to do to build a machine learning classification model with Python. ***Exercise 5: The Process of Building a Classification Model with Python*** **[Scikit-learn]**, a Python library for machine learning can be used to build a classifier in Python. **[The steps for building a classifier in Python are as follows]**: **Step1: Importing necessary python package** For building a classifier using scikit-learn, we need to import it. **[We can import it by using following script:]** **Step2: Importing dataset** After importing necessary package, **[we need a dataset to build classification prediction model.]** We can import it from sklearn dataset or can use other one as per our requirement. **[We are going to use sklearn's Breast Cancer Wisconsin Diagnostic Database]**. We can import it with the help of following script: ![](media/image94.png) **[The following script will load the dataset;]** We also need to organize the data and it can be done with the help of following scripts: ![](media/image96.png) The following command will print the name of the labels, **'malignant'** and **'benign'** in case of our database. The output of the above command is the names of the labels: ![](media/image98.png) **[These labels are mapped to binary values 0 and 1]**. **Malignant** cancer is represented by 0 and **Benign** cancer is represented by 1. **[The feature names and feature values of these labels can be seen with the help of following commands]**: The output of the above command **[is the names of the features for label 0]** i.e. **Malignant** cancer: ![](media/image100.png) Similarly, **names of the features for label 1 can b**e produced as follows: The output of the above command **[is the names of the features for label 1]** i.e. **Benign** cancer: ![](media/image102.png) We can **print the value of features for label 0** with the help of following command: **This will give the following output:** ![](media/image104.png) We can print the values of

Use Quizgecko on...
Browser
Browser