Python Modules for Data Science PDF
Document Details
Uploaded by RationalAntigorite660
Tags
Summary
This document provides an introduction to Python modules, focusing on the NumPy module, which is a powerful library for working with numerical arrays. It explains how arrays are created, indexed, and sliced, demonstrating different operations and examples. Numpy arrays are faster and more memory-efficient than Python lists, enabling efficient computations in data science.
Full Transcript
Week 7: Python Modules for Data Science Python Modules – Introduction Modules are used to categorize Python code into smaller parts. A module is simply a Python file where statements, classes, objects, functions, constants and variables are defined. Grouping similar code into a single file makes...
Week 7: Python Modules for Data Science Python Modules – Introduction Modules are used to categorize Python code into smaller parts. A module is simply a Python file where statements, classes, objects, functions, constants and variables are defined. Grouping similar code into a single file makes it easy to access. Python Module is a file that contains built-in functions, classes,its and variables. There are many Python modules, each with its specific work. In this article, we will cover all about Python modules, such as How to create our own simple module, Import Python modules, From statements in Python, we can use the alias to rename the module, etc. A Python module is a file containing Python definitions and statements. A module can define functions, classes, and variables. A module can also include runnable code. Grouping related code into a module makes the code easier to understand and use. It also makes the code logically organized. Import module in Python We can import the functions, and classes defined in a module to another module using the import statement in some other Python source file. When the interpreter encounters an import statement, it imports the module if the module is present in the search path. Python Import From Module Python’s from statement lets you import specific attributes from a module without importing the module as a whole. Example: # importing sqrt() and factorial from the # module math from math import sqrt, factorial # if we simply do "import math", then # math.sqrt(16) and math.factorial() # are required. print(sqrt(16)) print(factorial(6)) Locating Python Modules Whenever a module is imported in Python the interpreter looks for several locations. First, it will check for the built-in module, if not found then it looks for a list of directories defined in the sys.path. Python interpreter searches for the module in the following manner – First, it searches for the module in the current directory. If the module isn’t found in the current directory, Python then searches each directory in the shell variable PYTHONPATH. The PYTHONPATH is an environment variable, consisting of a list of directories. If that also fails python checks the installation-dependent list of directories configured at the time Python is installed. NumPy Module NumPy is a Python library used for working with arrays. In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists. The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Arrays are very frequently used in data science, where speed and resources are very important. NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently. This behaviour is called locality of reference in computer science. This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest CPU architectures. NumPy Module - Array NumPy is used to work with arrays. The array object in NumPy is called ndarray. We can create a NumPy ndarray object by using the array() function. NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices. Python lists are a substitute for arrays, but they fail to deliver the performance required while computing large sets of numerical data. To address this issue we use a Python library called NumPy. The word NumPy stands for Numerical Python. NumPy offers an array object called nparray. They are similar to standard Python sequences but differ in certain key factors. NumPy Arrays vs Inbuilt Python Sequences Unlike lists, NumPy arrays are of fixed size, and changing the size of an array will lead to the creation of a new array while the original array will be deleted. All the elements in an array are of the same type. Numpy arrays are faster, more efficient, and require less syntax than standard Python sequences. Numpy Array so fast because of the following reasons: Numpy arrays are written mostly in C language. Being written in C, the NumPy arrays are stored in contiguous memory locations which makes them accessible and easier to manipulate. This means that you can get the performance level of a C code with the ease of writing a Python program. 1. Homogeneous Data: NumPy arrays store elements of the same data type, making them more compact and memory-efficient than lists. 2. Fixed Data Type: NumPy arrays have a fixed data type, reducing memory overhead by eliminating the need to store type information for each element. 3. Contiguous Memory: NumPy arrays store elements in adjacent memory locations, reducing fragmentation and allowing for efficient access. Data Allocation in Numpy Array In NumPy, data is allocated contiguously in memory, following a well-defined layout consisting of the data buffer, shape, and strides. This is essential for efficient data access, vectorized operations, and compatibility with low-level libraries like BLAS and LAPACK. 1. Data Buffer: The data buffer in NumPy is a single, flat block of memory that stores the actual elements of the array, regardless of its dimensionality. This enables efficient element-wise operations and data access. 2. Shape: The shape of an array is a tuple of integers that represents the dimensions along each axis. Each integer corresponds to the size of the array along a specific dimension, which defines the number of elements along each axis and is essential for correctly indexing and reshaping the array. 3. Strides: Strides are tuples of integers that define the number of bytes to step in each dimension when moving from one element to the next. They determine the spacing between elements in memory and measure how many bytes are required to move from one element to another in each dimension. Create NumPy Array from a List The User can use the np alias to create ndarray of a list using the array() method. li = [1,2,3,4] numpyArr = np.array(li) Example: import numpy as np li = [1, 2, 3, 4] numpyArr = np.array(li) print(numpyArr) Example: import numpy as np li = [1, 2, 3, 4] numpyArr = np.array(li) print("li =", li, "and type(li) =", type(li)) print("numpyArr =", numpyArr, "and type(numpyArr) =", type(numpyArr)) NumPy Module - Array Indexing and Slicing NumPy indexing is used for accessing an element from an array by giving it an index value that starts from 0. Slicing NumPy arrays means extracting elements from an array in a specific range. It obtains a substring, subtuple, or sublist from a string, tuple, or list. Indexing an Array Indexing is used to access individual elements. It is also possible to extract entire rows, columns, or planes from multi-dimensional arrays with numpy indexing. Indexing starts from 0. In NumPy arrays, when arrays are used as indexes to access groups of elements, this is called indexing using index arrays. NumPy arrays can be indexed with arrays or with any other sequence like a list, etc. Example: import numpy as np arr=np.arange(1,10,2) print("Elements of array: ",arr) arr1=arr[np.array([4,0,2,-1,-2])] print("Indexed Elements of array arr: ",arr1) Indexing can be done in numpy by using an array as an index. In case of slice, a view or shallow copy of the array is returned but in index array a copy of the original array is returned. Numpy arrays can be indexed with other arrays or any other sequence with the exception of tuples. The last element is indexed by -1 second last by -2 and so on. Example: # Python program to demonstrate # the use of index arrays. import numpy as np # Create a sequence of integers from 10 to 1 with a step of -2 a = np.arange(10, 1, -2) print("\n A sequential array with a negative step: \n",a) # Indexes are specified inside the np.array method. newarr = a[np.array([3, 1, 2 ])] print("\n Elements at these indices are:\n",newarr) Example: import numpy as np # NumPy array with elements from 1 to 9 x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) # Index values can be negative. arr = x[np.array([1, 3, -3])] print("\n Elements are : \n",arr) Types of Indexing There are two types of indexing : 1. Basic Slicing and indexing : Consider the syntax x[obj] where x is the array and obj is the index. Slice object is the index in case of basic slicing. Basic slicing occurs when obj is : 1. a slice object that is of the form start : stop : step 2. an integer 3. or a tuple of slice objects and integers 2. Advanced indexing : Advanced indexing is triggered when obj is : an ndarray of type integer or Boolean or a tuple with at least one sequence object is a non tuple sequence object Example: # Python program for basic slicing. import numpy as np # Arrange elements from 0 to 19 a = np.arrange(20) print("\n Array is:\n ",a) # a[start:stop:step] print("\n a[-8:17:1] = ",a[-8:17:1]) # The : operator means all elements till the end. print("\n a[10:] = ",a[10:]) Example: # Python program for indexing using basic slicing with ellipsis import numpy as np # A 3 dimensional array. b = np.array([[[1, 2, 3],[4, 5, 6]], [[7, 8, 9],[10, 11, 12]]]) print(b[...,1]) #Equivalent to b[: ,: ,1 ] Example: # Python program showing advanced indexing import numpy as np a = np.array([[1 ,2 ],[3 ,4 ],[5 ,6 ]]) print(a[[0 ,1 ,2 ],[0 ,0 ,1]]) Example: import numpy as np a = np.arange(10) print("The ndarray is :") print(a) s = slice(2,7,2) print("After applying slice() Function:") print (a[s]) Example: import numpy as np a = np.arange(15) print("The array is :") print(a) # using the index directly b = a print("The Eighth item in the array is :") print (b) NumPy Basic Array Operations ndim – It returns the dimensions of the array. itemsize – It calculates the byte size of each element. dtype – It can determine the data type of the element. reshape – It provides a new view. slicing – It extracts a particular set of elements. linspace – Returns evenly spaced elements. Example: # Python code to perform arithmetic # operations on NumPy array import numpy as np # Initializing the array arr1 = np.arange(4, dtype = np.float_).reshape(2, 2) print('First array:') print(arr1) print('\nSecond array:') arr2 = np.array([12, 12]) print(arr2) print('\nAdding the two arrays:') print(np.add(arr1, arr2)) print('\nSubtracting the two arrays:') print(np.subtract(arr1, arr2)) print('\nMultiplying the two arrays:') print(np.multiply(arr1, arr2)) print('\nDividing the two arrays:') print(np.divide(arr1, arr2)) Example: import numpy as np first_array = np.array([1, 3, 5, 7]) second_array = np.array([2, 4, 6, 8]) # using the + operator result1 = first_array + second_array print("Using the + operator:",result1) # using the add() function result2 = np.add(first_array, second_array) print("Using the add() function:",result2) Example: import numpy as np first_array = np.array([3, 9, 27, 81]) second_array = np.array([2, 4, 8, 16]) # using the - operator result1 = first_array - second_array print("Using the - operator:",result1) # using the subtract() function result2 = np.subtract(first_array, second_array) print("Using the subtract() function:",result2) Example: # Python code to perform reciprocal operation # on NumPy array import numpy as np arr = np.array([25, 1.33, 1, 1, 100]) print('Our array is:') print(arr) print('\nAfter applying reciprocal function:') print(np.reciprocal(arr)) arr2 = np.array(, dtype = int) print('\nThe second array is:') print(arr2) print('\nAfter applying reciprocal function:') print(np.reciprocal(arr2)) Example: # Python code to perform power operation # on NumPy array import numpy as np arr = np.array([5, 10, 15]) print('First array is:') print(arr) print('\nApplying power function:') print(np.power(arr, 2)) print('\nSecond array is:') arr1 = np.array([1, 2, 3]) print(arr1) print('\nApplying power function again:') print(np.power(arr, arr1)) Example: # Python code to perform mod function # on NumPy array import numpy as np arr = np.array([5, 15, 20]) arr1 = np.array([2, 5, 9]) print('First array:') print(arr) print('\nSecond array:') print(arr1) print('\nApplying mod() function:') print(np.mod(arr, arr1)) print('\nApplying remainder() function:') print(np.remainder(arr, arr1)) Example: # importing python module named numpy import numpy as np # making a 3x3 array gfg = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # before transpose print(gfg, end ='\n\n') # after transpose print(gfg.transpose()) Example: # importing python module named numpy import numpy as np # making a 3x3 array gfg = np.array([[1, 2], [4, 5], [7, 8]]) # before transpose print(gfg, end ='\n\n') # after transpose print(gfg.transpose(1, 0)) Example: # importing python module named numpy import numpy as np # making a 3x3 array gfg = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # before transpose print(gfg, end ='\n\n') # after transpose print(gfg.T) Example: import numpy as np first_array = np.array([1, 3, 5, 7]) second_array = np.array([2, 4, 6, 8]) # using the * operator result1 = first_array * second_array print("Using the * operator:",result1) # using the multiply() function result2 = np.multiply(first_array, second_array) print("Using the multiply() function:",result2) Example: import numpy as np first_array = np.array([1, 2, 3]) second_array = np.array([4, 5, 6]) # using the / operator result1 = first_array / second_array print("Using the / operator:",result1) # using the divide() function result2 = np.divide(first_array, second_array) print("Using the divide() function:",result2) Example: import numpy as np array1 = np.array([1, 2, 3]) # using the ** operator result1 = array1 ** 2 print("Using the ** operator:",result1) # using the power() function result2 = np.power(array1, 2) print("Using the power() function:",result2) Example: import numpy as np arr1=np.array([1,2,3,4,5,6,7]) print("The original array is:",arr1) number=10 arr=arr1+10 print("The modified array is:",arr) Example: import numpy as np arr1=np.array([1,2,3,4,5,6,7]) print("The first array is:",arr1) arr2=np.array([8,9,10,11,12,13,14]) print("The second array is:",arr2) arr=arr1+arr2 print("The output array is:",arr) Example: import numpy as np arr1=np.array([1,2,3,4,5,6,7]) print("The first array is:",arr1) arr2=np.array([8,9,10,11,12,13,14]) print("The second array is:",arr2) arr=arr1*arr2 print("The output array is:",arr) Week 8: Python Modules for Data Science - SciPy and Matplotlib Scientific Computing with Python - SciPy Module Scientific computing in Python builds upon a small core of packages: Python, a general purpose programming language. It is interpreted and dynamically typed and is very well suited for interactive work and quick prototyping, while being powerful enough to write large applications in. SciPy is a library of numerical routines for the Python programming language that provides fundamental building blocks for modeling and solving scientific problems. SciPy includes algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations and many other classes of problems; it also provides specialized data structures, such as sparse matrices and k-dimensional trees. The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. Key Features and Modules 1. Linear Algebra The `scipy.linalg` module provides functions for performing linear algebra operations, such as solving linear systems, computing eigenvalues and eigenvectors, and matrix factorizations. These operations are fundamental to various scientific and engineering applications, including data analysis, machine learning, and simulations. 2. Optimization The `scipy.optimize` module offers a range of optimization algorithms for finding the minimum or maximum of functions. These algorithms are crucial for parameter estimation, model fitting, and solving optimization problems across different fields. From simple gradient-based methods to more advanced global optimization techniques, SciPy has you covered. 3. Signal and Image Processing With the `scipy.signal` and `scipy.ndimage` modules, you can perform tasks such as signal filtering, convolution, image manipulation, and feature extraction. These tools are vital for processing and analyzing signals, images, and multidimensional data. 4. Statistics The `scipy.stats` module provides a comprehensive suite of statistical functions for probability distributions, hypothesis testing, descriptive statistics, and more. Researchers and data analysts can leverage these tools to gain insights from data and make informed decisions. 5. Integration and Interpolation Integration and interpolation are common tasks in scientific computing. SciPy’s `scipy.integrate` module offers methods for numerical integration, while the `scipy.interpolate` module provides interpolation techniques to estimate values between data points. 6. Special Functions Scientific and mathematical computations often involve special functions like Bessel functions, gamma functions, and hypergeometric functions. The `scipy.special` module offers a collection of these functions, enabling researchers to solve complex mathematical problems. To begin using SciPy, you’ll need to have Python and NumPy installed on your system. Most often, these libraries are bundled together in scientific Python distributions like Anaconda. python import numpy as np from scipy import linalg # Coefficient matrix A = np.array([[2, 1], [1, 3]]) # Right-hand side vector b = np.array([5, 8]) # Solve the linear system x = linalg.solve(A, b) print(“Solution:”, x) In this example, the `linalg.solve` function from SciPy’s `linalg` module is used to solve the system of equations represented by the matrix `A` and vector `b`. scipy.integrate The scipy.integrate sub-package provides several integration techniques including an ordinary differential equation integrator. The first argument to quad is a “callable” Python object (i.e., a function, method, or class instance). Notice the use of a lambda- function in this case as the argument. The next two arguments are the limits of integration. The return value is a tuple, with the first element holding the estimated value of the integral and the second element holding an upper bound on the error. scipy.optimize SciPy optimize provides functions for minimizing (or maximizing) objective functions, possibly subject to constraints. It includes solvers for nonlinear problems (with support for both local and global optimization algorithms), linear programing, constrained and nonlinear least-squares, root finding, and curve fitting. scipy. interpolate Interpolation is a method for generating points between given points. Interpolation has many usage, in Machine Learning we often deal with missing data in a dataset, interpolation is often used to substitute those values. This method of filling values is called imputation. Apart from imputation, interpolation is often used where we need to smooth the discrete points in a dataset. SciPy provides us with a module called scipy.interpolate which has many functions to deal with interpolation. Example: from scipy.interpolate import interp1d import numpy as np xs = np.arange(10) ys = 2*xs + 1 interp_func = interp1d(xs, ys) newarr = interp_func(np.arange(2.1, 3, 0.1)) print(newarr) Basic Plotting in Python - Matplotlib Plotting x and y points The plot() function is used to draw points (markers) in a diagram. By default, the plot() function draws a line from point to point. The function takes parameters for specifying points in the diagram. Parameter 1 is an array containing the points on the x-axis. Matplotlib is a Python library that helps in visualizing and analyzing the data and helps in better understanding of the data with the help of graphical, pictorial visualizations that can be simulated using the matplotlib library. Matplotlib is a comprehensive library for static, animated and interactive visualizations. Installation of matplotlib library Step 1: Open command manager (just type “cmd” in your windows start search bar) Step 2: Type the below command in the terminal. cd Desktop Step 3: Then type the following command. pip install matplotlib Creating a Simple Plot # importing the required module import matplotlib.pyplot as plt # x axis values x = [1,2,3] # corresponding y axis values y = [2,4,1] # plotting the points plt.plot(x, y) # naming the x axis plt.xlabel('x - axis') # naming the y axis plt.ylabel('y - axis') # giving a title to my graph plt.title('My first graph!') # function to show the plot plt.show() Output: The code seems self-explanatory. Following steps were followed: Define the x-axis and corresponding y-axis values as lists. Plot them on canvas using.plot() function. Give a name to x-axis and y-axis using.xlabel() and.ylabel() functions. Give a title to your plot using.title() function. Finally, to view your plot, we use.show() function. some of the basic functions that are often used in matplotlib: Method Description It creates the plot at the background of computer, it doesn’t displays it. We can also add a label as it’s plot() argument that by what name we will call this plot – utilized in legend() show() it displays the created plots xlabel() it labels the x-axis ylabel() it labels the y-axis title() it gives the title to the graph it helps to get access over the all the four axes of the gca() graph gca().spines[‘right/left/top/bottom’]. it access the individual spines or the individual set_visible boundaries and helps to change theoir visibility (True/False) it decides how the markings are to be made on the x- xticks() axis it decides how the markings are to be made on the y- yticks() axis pass a list as it’s arguments of all the plots made, if gca().legend() labels are not explicitly specified then add the values in the list in the same order as the plots are made it is use to write comments on the graph at the annotate() specified position Method Description whenever we want the result to be displayed in a separate window we use this command, and figsize figure(figsize = (x, y)) argument decides what will be the initial size of the window that will be displayed after the run it is used to create multiple plots in the same figure with r signifies the no of rows in the figure, c subplot(r, c, i) signifies no of columns in a figure and i specifies the positioning of the particular plot it is used to set the range and the step size of the set_xticks markings on x – axis in a subplot it is used to set the range and the step size of the set_yticks markings on y – axis in a subplot Line Plot: Line plots are drawn by joining straight lines connecting data points where the x-axis and y- axis values intersect. Line plots are the simplest form of representing data. In Matplotlib, the plot() function represents this. Example: import matplotlib.pyplot as pyplot pyplot.plot([1,2,3,5,6], [1, 2, 3, 4, 6]) pyplot.axis([0, 7, 0, 10]) # Print the chart pyplot.show() Bar Plot The bar plots are vertical/horizontal rectangular graphs that show data comparison where you can gauge the changes over a period represented in another axis (mostly the X-axis). Each bar can store the value of one or multiple data divided in a ratio. The longer a bar becomes, the greater the value it holds. In Matplotlib, we use the bar() or barh() function to represent it. Example: import matplotlib.pyplot as pyplot pyplot.bar([0.25,2.25,3.25,5.25,7.25],[300,400,200,600,700], label="Carpenter",color='b',width=0.5) pyplot.bar([0.75,1.75,2.75,3.75,4.75],[50,30,20,50,60], label="Plumber", color='g',width=.5) pyplot.legend() pyplot.xlabel('Days') pyplot.ylabel('Wage') pyplot.title('Details') # Print the chart pyplot.show() Scatter Plot We can implement the scatter (previously called XY) plots while comparing various data variables to determine the connection between dependent and independent variables. The data gets expressed as a collection of points clustered together meaningfully. Here each value has one variable (x) determining the relationship with the other (Y). Example: import matplotlib.pyplot as pyplot x1 = [1, 2.5,3,4.5,5,6.5,7] y1 = [1,2, 3, 2, 1, 3, 4] x2=[8, 8.5, 9, 9.5, 10, 10.5, 11] y2=[3,3.5, 3.7, 4,4.5, 5, 5.2] pyplot.scatter(x1, y1, label = 'high bp low heartrate', color='c') pyplot.scatter(x2,y2,label='low bp high heartrate',color='g') pyplot.title('Smart Band Data Report') pyplot.xlabel('x') pyplot.ylabel('y') pyplot.legend() # Print the chart pyplot.show() Pie Plot A pie plot is a circular graph where the data get represented within that components/segments or slices of pie. Data analysts use them while representing the percentage or proportional data in which each pie slice represents an item or data classification. In Matplotlib, the pie() function represents it. Example: import matplotlib.pyplot as pyplot slice = [12, 25, 50, 36, 19] activities = ['NLP','Neural Network', 'Data analytics', 'Quantum Computing', 'Machine Learning'] cols = ['r','b','c','g', 'orange'] pyplot.pie(slice, labels =activities, colors = cols, startangle = 90, shadow = True, explode =(0,0.1,0,0,0), autopct ='%1.1f%%') pyplot.title('Training Subjects') # Print the chart pyplot.show() Area Plot The area plots spread across certain areas with bumps and drops (highs and lows) and are also known as stack plots. They look identical to the line plots and help track the changes over time for two or multiple related groups to make it one whole category. In Matplotlib, the stackplot() function represents it. Example: import matplotlib.pyplot as pyplot days = [1,2,3,4,5] age =[63, 81, 52, 22, 37] weight =[17, 28, 72, 52, 32] pyplot.plot([],[], color='c', label = 'Weather Predicted', linewidth=5) pyplot.plot([],[],color = 'g', label='Weather Change happened', linewidth=5) pyplot.stackplot(days, age, weight, colors = ['c', 'g']) pyplot.xlabel('Fluctuation with time') pyplot.ylabel('Days') pyplot.title('Weather report using Area Plot') pyplot.legend() # Print the chart pyplot.show() Histogram Plot We can use a histogram plot when the data remains distributed, whereas we can use a bar graph to compare two entities. Both histogram and bar plot look alike but are used in different scenarios. In Matplotlib, the hist() function represents this. Example: import matplotlib.pyplot as pyplot pop = [22,55,62,45,21,22,34,42,42,4,2,8] bins = [1,10,20,30,40,50] pyplot.hist(pop, bins, rwidth=0.6) pyplot.xlabel('age groups') pyplot.ylabel('Number of people') pyplot.title('Histogram') # Print the chart pyplot.show() Advanced Plotting in Python - Matplotlib Example: import matplotlib.pyplot as plt #Defining the x and y ranges xranges = [(5,5), (20,5),(20,7)] yrange = (2,1) #Plotting the broken bar chart plt.broken_barh(xranges, yrange, facecolors='green') xranges = [(6,2), (17,5),(50,2)] yrange = (15,1) plt.broken_barh(xranges, yrange, facecolors='orange') xranges = [(5,2), (28,5),(40,2)] yrange = (30,1) plt.broken_barh(xranges, yrange, facecolors='red') plt.xlabel('Sales') plt.ylabel('Days of the Month') plt.show() Example: import numpy as np import matplotlib.image as image import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv('income.csv') im = image.imread('Lebron_James.jpeg') # Image lebron_james = df[df['Name']=='LeBron James'] Interpolation of Data in python Interpolation in Python is a technique used to estimate unknown data points between two known data points. In Python, Interpolation is a technique mostly used to impute missing values in the data frame or series while preprocessing data Interpolation is mostly used while working with time-series data because, in time-series data, we like to fill missing values with the previous one or two values. for example, suppose temperature, now we would always prefer to fill today’s temperature with the mean of the last two days, not with the mean of the month. We can also use Interpolation for calculating the moving averages. Pandas series is a one-dimensional array that is capable of storing elements of various data types like lists. We can easily create a series with the help of a list, tuple, or dictionary. To perform all Interpolation methods we will create a pandas series with some NaN values and try to fill missing values with some interpolated values by the implementation of the interpolate methods or some other different methods of Interpolation. Example: import pandas as pd import numpy as np a = pd.Series([0, 1, np.nan, 3, 4, 5, 7]) Linear Interpolation Linear Interpolation simply means to estimate a missing value by connecting dots in a straight line in increasing order. In short, It estimates the unknown value in the same increasing order from previous values. The default method used by Interpolation is Linear Polynomial Interpolation In Polynomial Interpolation, you need to specify an order. It means that polynomial interpolation fills missing values with the lowest possible degree that passes through available data points. The polynomial Interpolation curve is like the trigonometric sin curve or assumes it like a parabola shape. WEEK-9 Python Modules for Data Science - SciPy and Matplotlib Part II Optimization of Data in Python Python also allocates extra information to store strings, which causes them to take up too much space. To increase efficiency, an optimization method called string interning is used. The idea behind string interning is to cache certain strings in memory as they are created. From a mathematical foundation viewpoint, it can be said that the three pillars for data science that we need to understand quite well are Linear Algebra, Statistics and the third pillar is Optimization which is used pretty much in all data science algorithms. And to understand the optimization concepts one needs a good fundamental understanding of linear algebra. Optimization is a problem where you maximize or minimize a real function by systematically choosing input values from an allowed set and computing the value of the function. That means when we talk about optimization we are always interested in finding the best solution. So, let say that one has some functional form(e.g in the form of f(x)) that he is interested in and he is trying to find the best solution for this functional form. One could either say he is interested in minimizing this functional form or maximizing this functional form. Almost all machine learning algorithms can be viewed as solutions to optimization problems and it is interesting that even in cases, where the original machine learning technique has a basis derived from other fields for example, from biology and so on one could still interpret all of these machine learning algorithms as some solution to an optimization problem. A basic understanding of optimization will help in: o More deeply understand the working of machine learning algorithms. o Rationalize the working of the algorithm. That means if you get a result and you want to interpret it, and if you had a very deep understanding of optimization you will be able to see why you got the result. o And at an even higher level of understanding, you might be able to develop new algorithms yourselves. Components of an Optimization Problem Generally, an optimization problem has three components. minimize f(x), w.r.t x, subject to a ≤ x ≤ b 1. The objective function(f(x)): The first component is an objective function f(x) which we are trying to either maximize or minimize. In general, we talk about minimization problems this is simply because if you have a maximization problem with f(x) we can convert it to a minimization problem with -f(x). So, without loss of generality, we can look at minimization problems. 2. Decision variables(x): The second component is the decision variables which we can choose to minimize the function. So, we write this as min f(x). 3. Constraints(a ≤ x ≤ b): The third component is the constraint which basically constrains this x to some set. Types of Optimization Problems: Depending on the types of constraints only: 1. Constrained optimization problems: In cases where the constraint is given there and we have to have the solution satisfy these constraints we call them constrained optimization problems. 2. Unconstrained optimization problems: In cases where the constraint is missing we call them unconstrained optimization problems. Depending on the types of objective functions, decision variables and constraints: If the decision variable(x) is a continuous variable: A variable x is said to be continuous if it takes an infinite number of values. In this case, x can take an infinite number of values between -2 to 2. min f(x), x ∈ (-2, 2) Linear programming problem: If the decision variable(x) is a continuous variable and if the objective function(f) is linear and all the constraints are also linear then this type of problem known as a linear programming problem. So, in this case, the decision variables are continuous, the objective function is linear and the constraints are also linear. Nonlinear programming problem: If the decision variable(x) remains continuous; however, if either the objective function(f) or the constraints are non-linear then this type of problem known as a non-linear programming problem. So, a programming problem becomes non-linear if either the objective or the constraints become non- linear. If the decision variable(x) is an integer variable: All numbers whose fractional part is 0 (zero) like -3, -2, 1, 0, 10, 100 are integers. min f(x), x ∈ [0, 1, 2, 3] Linear integer programming problem: If the decision variable(x) is an integer variable and if the objective function(f) is linear and all the constraints are also linear then this type of problem known as a linear integer programming problem. So, in this case, the decision variables are integers, the objective function is linear and the constraints are also linear. Nonlinear integer programming problem: If the decision variable(x) remains integer; however, if either the objective function(f) or the constraints are non-linear then this type of problem known as a non-linear integer programming problem. So, a programming problem becomes non-linear if either the objective or the constraints become non-linear. Binary integer programming problem: If the decision variable(x) can take only binary values like 0 and 1 only then this type of problem known as a binary integer programming problem. min f(x), x ∈ [0, 1] If the decision variable(x) is a mixed variable: If we combine both continuous variable and integer variable then this decision variable known as a mixed variable. min f(x1, x2), x1 ∈ [0, 1, 2, 3] and x2 ∈ (-2, 2) Mixed-integer linear programming problem: If the decision variable(x) is a mixed variable and if the objective function(f) is linear and all the constraints are also linear then this type of problem known as a mixed-integer linear programming problem. So, in this case, the decision variables are mixed, the objective function is linear and the constraints are also linear. Mixed-integer non-linear programming problem: If the decision variable(x) remains mixed; however, if either the objective function(f) or the constraints are non-linear then this type of problem known as a mixed-integer non-linear programming problem. So, a programming problem becomes non-linear if either the objective or the constraints become non-linear. Linear algebra operations - scipy.linalg ( SciPy Linalg Package) in Python The scipy.linalg module provides standard linear algebra operations, relying on an underlying efficient implementation (BLAS, LAPACK). The scipy.linalg.det() function computes the determinant of a square matrix: The scipy.linalg.inv() function computes the inverse of a square matrix: Finally computing the inverse of a singular matrix (its determinant is zero) will raise LinAlgError: For example singular-value decomposition (SVD): The resulting array spectrum is: The original matrix can be re-composed by matrix multiplication of the outputs of svd with np.dot: SVD is commonly used in statistics and signal processing. Many other standard decompositions (QR, LU, Cholesky, Schur), as well as solvers for linear systems, are available in scipy.linalg. SciPy Stats Package in Python The module scipy.stats contains statistical tools and probabilistic descriptions of random processes. Random number generators for various random process can be found in numpy.random. Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability density function): Mean, median and percentiles The mean is an estimator of the center of the distribution: >>> np.mean(samples) -0.0452567074... The median another estimator of the center. It is the value with half of the observations below, and half above: >>> np.median(samples) -0.0580280347... The median is also the percentile 50, because 50% of the observation are below it: >>> stats.scoreatpercentile(samples, 50) -0.0580280347... >>> stats.scoreatpercentile(samples, 90) 1.2315935511... Std std computes the standard deviation of an array. An optional second argument provides the axis to use (default is to use entire array). std can be used either as a function or as a method on an array. var var computes the variance of an array. An optional second argument provides the axis to use (default is to use entire array). var can be used either as a function or as a method on an array. corrcoef corrcoef(x) computes the correlation between the rows of a 2-dimensional array x. corrcoef(x, y) com- putes the correlation between two 1- dimensional vectors. An optional keyword argument rowvar can be used to compute the correlation between the columns of the input – this is corrcoef(x, rowvar=False) and corrcoef(x.T) are identical. Statistical Tests A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the means of two sets of observations are significantly different: >>> a = np.random.normal(0, 1, size=100) >>> b = np.random.normal(1, 1, size=10) >>> stats.ttest_ind(a, b) (array(-3.177574054...), 0.0019370639...) The T statistic value: it is a number the sign of which is proportional to the difference between the two random processes and the magnitude is related to the significance of this difference. the p value: the probability of both processes being identical. If it is close to 1, the two process are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means. Python is a general-purpose language with statistics modules. R has more statistical analysis features than Python, and specialized syntaxes. However, when it comes to building omplex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of Python is an invaluable asset. Data representation and interaction The setting that we consider for statistical analysis is that of multiple observations or samples described by a set of different attributes or features. The data can than be seen as a 2D table, or matrix, with columns giving the different attributes of the data, and rows the observations. For instance, the data contained in examples/brain_size.csv: "";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count" "1";"Female";133;132;124;"118";"64.5";816932 "2";"Male";140;150;124;".";"72.5";1001121 "3";"Male";139;123;150;"143";"73.3";1038437 "4";"Male";133;129;128;"172";"68.8";965353 "5";"Female";137;132;134;"147";"65.0";951545 Hypothesis testing: comparing two groups For simple statistical tests, we will use the scipy.stats sub-module of scipy: >>> from scipy import stats i) Student’s t-test: the simplest statistical test: 1-sample t-test: testing the value of a population mean Scipy lecture notes, Edition 2022.1 16.2 Hypothesis testing: comparing two groups For simple statistical tests, we will use the scipy.stats sub-module of scipy: >>> from scipy import stats 2-sample t-test: testing for difference across populations To test if this is significant, we do a 2-sample t-test with scipy.stats.ttest_ind() >>> female_viq = data[data['Gender'] == 'Female']['VIQ'] >>> male_viq = data[data['Gender'] == 'Male']['VIQ'] >>> stats.ttest_ind(female_viq, male_viq) Ttest_indResult(statistic=-0.77261617232..., pvalue=0.4445287677858...) ii) Paired tests: repeated measurements on the same individuals PIQ, VIQ, and FSIQ give 3 measures of IQ. Let us test if FISQ and PIQ are significantly different. We can use a 2 sample test: The problem with this approach is that it forgets that there are links between observations: FSIQ and PIQ are measured on the same individuals. Thus the variance due to inter-subject variability is confounding, and can be removed, using a “paired test”, or “repeated measures test”: >>> stats.ttest_rel(data['FSIQ'], data['PIQ']) Ttest_relResult(statistic=1.784201940..., pvalue=0.082172638183...) This is equivalent to a 1-sample test on the difference: T-tests assume Gaussian errors. We can use a Wilcoxon signed-rank test, that relaxes this assumption: Statistical models in Python i) Simple linear regression: Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. In other terms: 𝑦 = 𝑥 * coef + intercept + 𝑒 where e is observation noise. We will use the statsmodels module to: 1. Fit a linear model. We will use the simplest strategy, ordinary least squares (OLS). 2. Test that coef is non zero First, we generate simulated data according to the model: >>> import numpy as np >>> x = np.linspace(-5, 5, 20) >>> np.random.seed(1) >>> # normal distributed noise >>> y = -5 + 3*x + 4 * np.random.normal(size=x.shape) >>> # Create a data frame containing all the relevant variables >>> data = pandas.DataFrame({'x': x, 'y': y}) Then we specify an OLS model and fit it: >>> from statsmodels.formula.api import ols >>> model = ols("y ~ x", data).fit() ii) Multiple Regression: including multiple factors Consider a linear model explaining a variable z (the dependent variable) with 2 variables x and y: 𝑧 = 𝑥 𝑐1 + 𝑦 𝑐2 + 𝑖 + 𝑒 Such a model can be seen in 3D as fitting a plane to a cloud of (x, y, z) points. Example: the iris data (examples/iris.csv) iii) Post-hoc hypothesis testing: analysis of variance (ANOVA) In the above iris example, we wish to test if the petal length is different between versicolor and virginica, after removing the effect of sepal width. This can be formulated as testing the difference between the coefficient associated to versicolor and virginica in the linear model estimated above (it is an Analysis of Variance, ANOVA). For this, we write a vector of ‘contrast’ on the parameters estimated: we want to test "name[T.versicolor] - name[T.virginica]", with an F-test: Select Statistics Functions i) Mode: mode computes the mode of an array. An optional second argument provides the axis to use (default is to use entire array). Returns two outputs: the first contains the values of the mode, the second contains the number of occurrences. >>> x=randint(1,11,1000) >>> stats.mode(x) (array([ 4.]), array([ 112.])) ii) moment moment computed the rth central moment for an array. An optional second argument provides the axis to use (default is to use entire array). iii) skew skew computes the skewness of an array. An optional second argument provides the axis to use (default is to use entire array). Week-10 Python Modules for Data Science Data Processing and Analysis Data processing services are available in various encodings, including CSV, XML, HTML, SQL, and JSON. Each situation requires a unique processing format. There are numerous programming languages. Python is frequently recommended as a viable alternative for machine learning applications due to its implementation of major libraries and cutting-edge technologies. Machine learning is built on data processing, and model success is highly dependent on the ability to read and transform data into the format required for the task at hand. Most of the large data is available in the tabular format, with rows referring to records and columns corresponding to features. Pandas in Python can handle such type data very perfectly. The advent of tabular data has evolved into a full-featured library that can handle both series and tabular data. However, many natural language processing techniques, such as tokenization and lemmatization, may be done using NLTK. Along with that, Spacy is a good choice for advanced natural language processing and optimised pipelines. Python, in particular, is a highly regarded data processing language for a variety of reasons, including the following: Prototypes and experimentation with code are incredibly simple. Processing data, especially from less-than-clean sources, necessitates a great deal of tweaking, back and forth, and a struggle to capture all options. Python3 significantly improved multi-language support by making every string in the system UTF-8, which enables the processing of data encoded in different character sets by different languages. The standard library is quite strong and packed with essential modules that provide native support for common file types such as CSV files, zip files, and databases. The Python third-party library is enormous, and it has a wealth of excellent modules that enable it to increase the capabilities of a programme. There are also modules for geospatial data analysis, creating command-line interfaces, graphical interfaces, parsing data, and everything in between. Jupyter Notebooks allows you to execute code and receive immediate feedback. Python is quite agnostic about the development environment required, allowing it to function with anything from a simple text editor to more complex alternatives such as Visual Studio. Pandas Series Data Structure A Series is a one-dimensional array of data. It can hold data of any type: string, integer, float, dictionaries, lists, booleans, and more. The easiest way to conceptualize a Series is a single column in a table, which is why it's considered one-dimensional. Pandas Series can evaluate financial data such as stock prices, currency exchange rates, and commodities prices. You may, for example, use the Series object to save daily stock values and generate statistics like mean, median, and standard deviation. Pandas Series can evaluate financial data such as stock prices, currency exchange rates, and commodities prices. You may, for example, use the Series object to save daily stock values and generate statistics like mean, median, and standard deviation. To visualize the data, you may also use pandas‘ built-in charting methods. Here’s an example of code: Pandas Series may also be used to evaluate time series data across time, such as temperature or rainfall. The Series object may be used to store data and execute operations such as resampling, interpolation, and rolling window computations. Here's an example of code. The following code accepts temperature data from a CSV file, converts it to a Pandas Series, and then does operations on it such as resampling, rolling window computations, and interpolation. It also computes the data's autocorrelation. Pandas is a one-dimensional labeled array and capable of holding data of any type (integer, string, float, python objects, etc.) Syntax: pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False) data: array- Contains data stored in Series. index: array-like or Index (1d) dtype: str, numpy.dtype, or ExtensionDtype, optional name: str, optional copy: bool, default False Example: import pandas as pd # a simple char list list = ['g', 'e', 'e', 'k', 's'] # create series form a char list res = pd.Series(list) print(res) Example: import pandas as pd # a simple int list list = [1,2,3,4,5] # create series form a int list res = pd.Series(list) print(res) Creating a Series: We can create a Series in two ways: 1. Create an empty Series 2. Create a Series using inputs. Create an Empty Series: We can easily create an empty series in Pandas which means it will not have any value. The syntax that is used for creating an Empty Series: = pandas.Series() The below example creates an Empty Series type object that has no values and having default datatype, i.e., float64. import pandas as pd x = pd.Series() print (x) Creating a Series using inputs: We can create Series by using various inputs: Array Dict Scalar value Creating Series from Array: Before creating a Series, firstly, we have to import the numpy module and then use array() function in the program. If the data is ndarray, then the passed index must be of the same length. If we do not pass an index, then by default index of range(n) is being passed where n defines the length of an array, i.e., [0,1,2,....range(len(array))-1]. Example import pandas as pd import numpy as np info = np.array(['P','a','n','d','a','s']) a = pd.Series(info) print(a) Create a Series from dict We can also create a Series from dict. If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index. If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary. #import the pandas library import pandas as pd import numpy as np info = {'x' : 0., 'y' : 1., 'z' : 2.} a = pd.Series(info) print (a) Create a Series using Scalar: If we take the scalar values, then the index must be provided. The scalar value will be repeated for matching the length of the index. #import pandas library import pandas as pd import numpy as np x = pd.Series(4, index=[0, 1, 2, 3]) print (x) Python Dataframe Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two- dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns like a spreadsheet or SQL table, or a dict of Series objects.. Pandas DataFrame consists of three principal components, the data, rows, and columns. Creating Python DataFrame In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe: Example 1: DataFrame can be created using a single list or a list of lists. # import pandas as pd import pandas as pd # list of strings lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks'] # Calling DataFrame constructor on list df = pd.DataFrame(lst) display(df) Example 2: Creating DataFrame from dict of ndarray/lists. To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. # Python code demonstrate creating # DataFrame from dict narray / lists # By default addresses. import pandas as pd # initialise data of lists. data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]} # Create DataFrame df = pd.DataFrame(data) # Print the output. display(df) Dealing with a column and row in DataFrame Selection of column: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name. # Import pandas package import pandas as pd # Define a dictionary containing employee data data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd']} # Convert the dictionary into DataFrame df = pd.DataFrame(data) # select two columns print(df[['Name', 'Qualification']]) Select Rows and Column from Pandas DataFrame Example 1: Selecting rows. pandas.DataFrame.loc is a function used to select rows from Pandas DataFrame based on the condition provided. Syntax: df.loc[df[‘cname’] ‘condition’] Parameters: df: represents data frame cname: represents column name condition: represents condition on which rows has to be selected Example: # Importing pandas as pd from pandas import DataFrame # Creating a data frame Data = {'Name': ['Mohe', 'Shyni', 'Parul', 'Sam'], 'ID': [12, 43, 54, 32], 'Place': ['Delhi', 'Kochi', 'Pune', 'Patna'] } df = DataFrame(Data, columns = ['Name', 'ID', 'Place']) # Print original data frame print("Original data frame:\n") display(df) # Selecting the product of Electronic Type select_prod = df.loc[df['Name'] == 'Mohe'] print("\n") # Print selected rows based on the condition print("Selecting rows:\n") display (select_prod) Example 2: Selecting column. # Importing pandas as pd from pandas import DataFrame # Creating a data frame Data = {'Name': ['Mohe', 'Shyni', 'Parul', 'Sam'], 'ID': [12, 43, 54, 32], 'Place': ['Delhi', 'Kochi', 'Pune', 'Patna'] } df = DataFrame(Data, columns = ['Name', 'ID', 'Place']) # Print original data frame print("Original data frame:") display(df) print("Selected column: ") display(df[['Name', 'ID']] ) Pandas DataFrame reindex( ) Method The reindex() method allows you to change the row indexes, and the columns labels. Syntax dataframe.reindex(keys, method, copy, level, fill_value, limit, tolerance) Parameter Value Description Required. String or list containing row indexes or keys column labels None 'backfill' Optional, default None. Specifies the method to use 'bfill' method 'pad' when filling holes in the indexes. For 'ffill' increasing/decreasing indexes only. 'nearest' Optional, default True. Whether to return a new object True copy False (a copy) when all the new indexes are the same as the old Number level Optional Label Optional, default NaN. Specifies the value to use for fill_value List of values missing values limit Number Optional, default None. tolerance Optional Example: import pandas as pd data = { "age": [50, 40, 30, 40], "qualified": [True, False, False, False] } idx = ["Sally", "Mary", "John", "Monica"] df = pd.DataFrame(data, index=idx) newidx = ["Robert", "Cindy", "Chloe", "Pete"] newdf = df.reindex(newidx) print(newdf) Reindexing in Pandas can be used to change the index of rows and columns of a DataFrame. Indexes can be used with reference to many index DataStructure associated with several pandas series or pandas DataFrame. Let’s see how can we Reindex the columns and rows in Pandas DataFrame. Reindexing is used to change the row labels and column labels of a DataFrame. It means to confirm the data to match a given set of labels along a particular axis. It helps us to perform Multiple operations through indexing like – To insert missing value (NaN) markers in label locations where no data for the label existed before. To reorder the existing data to match a new set of labels. Example import pandas as pd import numpy as np N=20 data = pd.DataFrame({ 'A': pd.date_range(start='2016-01- 01',periods=N,freq='D'), 'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N), 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 'D': np.random.normal(100, 10, size=(N)).tolist()}) #reindexing the DataFrame data_reindexed = data.reindex(index=[0,2,5], columns=['A', 'C', 'B']) print(data_reindexed) Reindexing the Rows One can reindex a single row or multiple rows by using reindex() method. Default values in the new index that are not present in the dataframe are assigned NaN. Example: # import numpy and pandas module import pandas as pd import numpy as np column=['a','b','c','d','e'] index=['A','B','C','D','E'] # create a dataframe of random values of array df1 = pd.DataFrame(np.random.rand(5,5), columns=column, index=index) print(df1) print('\n\nDataframe after reindexing rows: \n', df1.reindex(['B', 'D', 'A', 'C', 'E'])) Example: # import numpy and pandas module import pandas as pd import numpy as np column = ['a', 'b', 'c', 'd', 'e'] index = ['A', 'B', 'C', 'D', 'E'] # create a dataframe of random values of array df1 = pd.DataFrame(np.random.rand(5, 5), columns = column, index = index) # create the new index for rows new_index =['U', 'A', 'B', 'C', 'Z'] print(df1.reindex(new_index)) Reindexing the columns using the axis keyword One can reindex a single column or multiple columns by using reindex() method and by specifying the axis we want to reindex. Default values in the new index that are not present in the dataframe are assigned NaN. Example: # import numpy and pandas module import pandas as pd import numpy as np column=['a','b','c','d','e'] index=['A','B','C','D','E'] #create a dataframe of random values of array df1 = pd.DataFrame(np.random.rand(5,5), columns=column, index=index) column=['e','a','b','c','d'] # create the new index for columns print(df1.reindex(column, axis='columns')) Example: # import numpy and pandas module import pandas as pd import numpy as np column =['a', 'b', 'c', 'd', 'e'] index =['A', 'B', 'C', 'D', 'E'] # create a dataframe of random values of array df1 = pd.DataFrame(np.random.rand(5, 5), columns = column, index = index) column =['a', 'b', 'c', 'g', 'h'] # create the new index for columns print(df1.reindex(column, axis ='columns')) Replacing the missing values Missing values from the dataframe can be filled by passing a value to the keyword fill_value. This keyword replaces the NaN values. Example: # import numpy and pandas module import pandas as pd import numpy as np column =['a', 'b', 'c', 'd', 'e'] index =['A', 'B', 'C', 'D', 'E'] # create a dataframe of random values of array df1 = pd.DataFrame(np.random.rand(5, 5), columns = column, index = index) column =['a', 'b', 'c', 'g', 'h'] # create the new index for columns print(df1.reindex(column, axis ='columns', fill_value = 1.5)) Replacing the missing data with a string: # import numpy and pandas module import pandas as pd import numpy as np column =['a', 'b', 'c', 'd', 'e'] index =['A', 'B', 'C', 'D', 'E'] # create a dataframe of random values of array df1 = pd.DataFrame(np.random.rand(5, 5), columns = column, index = index) column =['a', 'b', 'c', 'g', 'h'] # create the new index for columns print(df1.reindex(column, axis ='columns', fill_value ='data missing')) Limit on Filling values while Reindexing Reindex() function also takes a parameter “limit” which is used to a maximum count of the consecutive matches. Example: import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']) # Padding NAN's print(df2.reindex_like(df1)) # Now Fill the NAN's with preceding Values print ("Data Frame with Forward Fill limiting to 1:") print(df2.reindex_like(df1,method='ffill',limit=1)) Python Pandas - Function Application To apply your own or another library’s functions to Pandas objects, user should be aware of the three important methods. The methods have been discussed below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame, row- or column-wise, or element wise. Table wise Function Application: pipe() Row or Column Wise Function Application: apply() Element wise Function Application: applymap() Table-wise Function Application Custom operations can be performed by passing the function and the appropriate number of parameters as pipe arguments. Thus, operation is performed on the whole DataFrame. For example, add a value 2 to all the elements in the DataFrame. Then, adder function The adder function adds two numeric values as parameters and returns the sum. def adder(ele1,ele2): return ele1+ele2 Use the custom function to conduct operation on the DataFrame. df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']) df.pipe(adder,2) Example: import pandas as pd import numpy as np def adder(ele1,ele2): return ele1+ele2 df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']) df.pipe(adder,2) print df.apply(np.mean) Row or Column Wise Function Application Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. By default, the operation performs column wise, taking each column as an array-like. Example: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3 ']) df.apply(np.mean) print df.apply(np.mean) By passing axis parameter, operations can be performed row wise. Example 2 import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3 ']) df.apply(np.mean,axis=1) print df.apply(np.mean) Element Wise Function Application Not all functions can be vectorized (neither the NumPy arrays which return another array nor any value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value. Example: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3 ']) # My custom function df['col1'].map(lambda x:x*100) print df.apply(np.mean) Week-11 Python Pandas In Python Pandas, the user can work around statistics operations using the following statistical functions. It can be applied to a Series or DataFrame. sum(): Return the sum of the values. count(): Return the count of non-empty values. max(): Return the maximum of the values. min(): Return the minimum of the values. mean(): Return the mean of the values. median(): Return the median of the values. std(): Return the standard deviation of the values. describe(): Return the summary statistics for each column. Pandas sum() method: The sum() method in Python Pandas is used to return the sum of the values. Consider the following example: import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Sum of Marks in each column print("\nSum = \n",df.sum()) Pandas count() method The count() method in Python Pandas is used to return the count of non-empty values. Consider the following example: import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, None, 55, 78], 'Science': [92, 87, 59, None, None, 96], 'English': [95, None, 84, 75, 67, None] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Count of non-empty values in each column print("\nCount of non-empty values = \n",df.count()) Pandas max() method The max() method in Python Pandas is used to return the maximum of the values. Consider the following example: import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Maximum of Marks in each column print("\nMaximum Marks = \n",df.max()) Pandas min() method The min() method in Python Pandas is used to return the minimum of the values. Consider the following example: import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Minimum of Marks in each column print("\nMinimum Marks = \n",df.min()) Pandas mean() method The mean() method in Python Pandas is used to return the mean of the values. Consider the following example: import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Mean of Marks in each column print("\nMean = \n",df.mean()) Pandas median() method The median() method in Python Pandas is used to return the median of the values. import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Median of Marks in each column print("\nMedian = \n",df.median()) Pandas std() method The std() method in Python Pandas is used to return the standard deviation of the values. import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, 80, 55, 78], 'Science': [92, 87, 59, 64, 87, 96], 'English': [95, 94, 84, 75, 67, 65] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the Standard Deviation of Marks in each column print("\nStandard Deviation = \n",df.std()) Pandas describe() method The describe() method in Python Pandas is used to return the summary statistics for each column. import pandas as pd # Dataset data = { 'Maths': [90, 85, 98, None, 55, 78], 'Science': [92, 87, 59, None, None, 96], 'English': [95, None, 84, 75, 67, None] } # DataFrame df = pd.DataFrame(data) # Display the DataFrame print("DataFrame = \n",df) # Display the summary using the describe() method print("\nSummary of Statistics = \n",df.describe()) Python Pandas - Handling Missing Data Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed. For Example, Suppose different users being surveyed may choose not to share their income, some users may choose not to share the address in this way many datasets went missing. In Pandas missing data is represented by two value: None: None is a Python singleton object that is often used for missing data in Python code. NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame : isnull() notnull() dropna() fillna() replace() interpolate() Checking for missing values using isnull() and notnull() In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series. Example: # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]} # creating a dataframe from list df = pd.DataFrame(dict) # using isnull() function df.isnull() Filling missing values using fillna(), replace() and interpolate() In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value. Example: # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]} # creating a dataframe from dictionary df = pd.DataFrame(dict) # filling missing value using fillna() df.fillna(0) Example: # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]} # creating a dataframe from dictionary df = pd.DataFrame(dict) # filling a missing value with # previous ones df.fillna(method ='pad') Example: # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]} # creating a dataframe from dictionary df = pd.DataFrame(dict) # filling null value using fillna() function df.fillna(method ='bfill') Dropping missing values using dropna() In order to drop a null values from a dataframe, we used dropna() function this function drop Rows/Columns of datasets with Null values in different ways. Example: # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, 40, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]} # creating a dataframe from dictionary df = pd.DataFrame(dict) df Python Pandas - Categorical Data Categorical are a pandas data type that corresponds to the categorical variables in statistics. Such variables take on a fixed and limited number of possible values. For examples – grades, gender, blood group type etc. Also, in the case of categorical variables, logical order is not the same as categorical data e.g. “one”, “two”, “three”. But the sorting of these variables uses logical order. Example: # Python code explaining # numpy.pandas.Categorical() # importing libraries import numpy as np import pandas as pd # Categorical using dtype c = pd.Series(["a", "b", "d", "a", "d"], dtype ="category") print ("\nCategorical without pandas.Categorical() : \n", c) c1 = pd.Categorical([1, 2, 3, 1, 2, 3]) print ("\n\nc1 : ", c1) c2 = pd.Categorical(['e', 'm', 'f', 'i', 'f', 'e', 'h', 'm' ]) print ("\nc2 : ", c2) Categorical variables can take on only a limited, and usually fixed number of possible values. Besides the fixed length, categorical data might have an order but cannot perform numerical operation. Categorical are a Pandas data type. The categorical data type is useful in the following cases – A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory. The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order. As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types). Categorical Data Methods Let us now learn about some of the categorical data methods. series.astype() : This method converts the series data into categorical data. categoricals.cat : The cat attribute helps us to access the categorical methods. categoricals.cat.codes : The codes method is used to view the codes of values present in the data. categoricals.cat.categories: The categories method is used to view the categories of the data. categoricals.cat.set_categories : The set_categories method is used to increase the values of the categories. categoricals.cat.remove_unused_categories : The remove_unused_categories method is used to remove the unused categories present in the data. pandas.get_dummies(categoricals) : The get_dummies function is used to convert the categorical data into dummy data. Categorical Object Creation Various types of categorical objects are: 1. Series Creation A series is nothing but a column present in the Pandas DataFrame (which can be seen as a table). Let us see how we can create categorical data when creating a Series. If we want the series to be in the form of categorical data, we can specify the dtype (data type) as category. Refer to the example provided below for more clarity. Example: # importing the necessary module. import pandas as pd # creating a series and specifying its data type. series = pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="category") print(series) 2. DataFrame Creation As we have discussed above, similar to the conversion of one series into a categorical data series, we can convert the entire series into a categorical data frame. Refer to the example provided below for more clarity. Example: # importing the necessary module. import pandas as pd # creating a data frame. dataFrame = pd.DataFrame( {"I": list("abcd"), "II": list("bcde")}, dtype="category") print(dataFrame) print(dataFrame.dtypes) 3. Controlling Behavior In the above two examples of series and data frames, we have passed the default behavior i.e. category as the data type. Let us now learn about other ways of defining categories. 1. We can pass the instance of CategoricalDtype in place of category in the series. Refer to the example provided below for more clarity. Example: # importing the necessary module. import pandas as pd from pandas.api.types import CategoricalDtype # creating a series. series = pd.Series(["a", "b", "c", "d"]) # Provide the category data type. new_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True) print(new_type) 2. We can pass the instance of CategoricalDtype in place of category in data frames. Refer to the example provided below for more clarity. Example: # importing the necessary module. import pandas as pd from pandas.api.types import CategoricalDtype # creating a data frame. dataFrame = pd.DataFrame({"A": list("abcd"), "B": list("bcde")}) # Provide the category data type. new_type = CategoricalDtype(categories=list("abcd"), ordered=True) print(new_type) 4. Regaining Original Data We can convert the categorical data into the original data and can use the Series.astype(original-dtype) or np.asarray(categorical) functions Example: importing the necessary module. import pandas as pd import numpy as np # creating a series. series = pd.Series(["a", "b", "c", "a"]) print("Original data:\n", series) # Provide the category data type. new_type = series.astype("category") Data Visualization with Pandas Data Visualization with Pandas is the presentation of data in a graphical format. It helps people understand the significance of data by summarizing and presenting a huge amount of data in a simple and easy-to-understand format and helps communicate information clearly and effectively. Example: import numpy as np import pandas as pd # There are some fake data csv files # you can read in as dataframes df1 = pd.read_csv('df1', index_col=0) df2 = pd.read_csv('df2') Style Sheets Matplotlib has style sheets that can be used to make the plots look a little nicer. These style sheets include plot_bmh, plot_fivethirtyeight, plot_ggplot, and more. They basically create a set of style rules that your plots follow. We recommend using them, they make all your plots have the same look and feel more professional. We can even create our own if want the company’s plots to all have the same look (it is a bit tedious to create on though). Here is how to use them. Before plt.style.use() plots look like this: Example: df1['A'].hist() Pandas DataFrame Plots There are several plot types built-in to pandas, most of them statistical plots by nature: df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie Area Plots using Pandas DataFrame An area chart or area graph displays graphically quantitative data. It is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings. Bar Plots using Pandas DataFrame A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes called a line graph. Histogram Plot using Pandas DataFrame A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc. Example: df1['A'].plot.hist(bins=50) Line Plot using Pandas DataFrame A line plot is a graph that shows the frequency of data along a number line. It is best to use a line plot when the data is time series. It is a quick, simple way to organize data. df1.plot.line(x=df1.index, y='B', figsize=(12, 3), lw=1) Scatter Plot using Pandas DataFrame Scatter plots are used when you want to show the relationship between two variables. Scatter plots are sometimes called correlation plots because they show how two variables are correlated. Example: df1.plot.scatter(x='A', y='B') Box Plots using Pandas DataFrame It is a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines on either side of the rectangle. A boxplot is a standardized way of displaying the distribution of data based on a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed. df2.plot.box() # Can also pass a by = argument for groupby Python Pandas - Sorting In order to sort the data frame in pandas, the function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending or Descending order. Pandas DataFrame Sorting in Ascending Order The code snippet sorts the DataFrame df in ascending order based on the ‘Country’ column. However, it does not store or display the sorted data frame. # Sorting by column 'Country' df.sort_values(by=['Country']) Sorting the Pandas DataFrame in Descending order The DataFrame df will be sorted in descending order based on the “Population” column, with the country having the highest population appearing at the top of the DataFrame. # Sorting by column "Population" df.sort_values(by=['Population'], ascending=False) Sort Pandas DataFrame Based on Sampling Here, we are sorting a DataFrame (df) based on the ‘Population’ column, arranging rows with missing values in ‘Population’ to appear first. The sort_values() method with the na_position='first' argument achieves this, prioritizing rows with missing values at the beginning of the sorted DataFrame. # Sorting by column "Population" # by putting missing values first df.sort_values(by=['Population'], na_position='first') Week-12 Advanced Concepts and Error Handling in Python Regression Regression searches for relationships among variables. For example, you can observe several employees of some company and try to understand how their salaries depend on their features, such as experience, education level, role, city of employment, and so on. This is a regression problem where data related to each employee represents one observation. The presumption is that the experience, education, role, and city are the independent features, while the salary depends on them. Similarly, you can try to establish the mathematical dependence of housing prices on area, number of bedrooms, distance to the city center, and so on. Generally, in regression analysis, you consider some phenomenon of interest and have a number of observations. Each observation has two or more features. Following the assumption that at least one of the features depends on the others, you try to establish a relation among them. In other words, you need to find a function that maps some features or variables to others sufficiently well. The dependent features are called the dependent variables, outputs, or responses. The independent features are called the independent variables, inputs, regressors, or predictors. Regression is used in many different fields, including economics, computer science, and the social sciences. Its importance rises every day with the availability of large amounts of data and increased awareness of the practical value of data. Linear Regression Linear regression is probably one of the most important and widely used regression techniques. It’s among the simplest regression methods. One of its main advantages is the ease of interpreting results. When implementing linear regression of some dependent variable 𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error. Linear regression calculates the estimators of the regression coefficients or simply the predicted weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ. These estimators define the estimated regression function (𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output sufficiently well. The estimated or predicted response, (𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - (𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called the residuals. Regression is about determining the best predicted weights—that is, the weights corresponding to the smallest residuals. Regression Performance The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, occurs partly due to the dependence on the predictors 𝐱ᵢ. However, there’s also an additional inherent variance of the output. The coefficient of determination, denoted as 𝑅², tells you which amount of variation in 𝑦 can be explained by the dependence on 𝐱, using the particular regression model. A larger 𝑅² indicates a better fit and means that the model can better explain the variation of the output with different inputs. The value 𝑅² = 1 corresponds to SSR = 0. That’s the perfect fit, since the values of predicted and actual responses fit completely to each other. Simple Linear Regression Simple or single-variate linear regression is the simplest case of linear regression, as it has a single independent variable, 𝐱 = 𝑥. The following figure illustrates simple linear regression: Multiple Linear Regression in Python Multiple or multivariate linear regression is a case of linear regression with two or more independent variables. If there are just two independent variables, then the estimated regression function is (𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂. It represents a regression plane in a three-dimensional space. The goal of regression is to determine the values of the weights 𝑏₀, 𝑏₁, and 𝑏₂ such that this plane is as close as possible to the actual responses, while yielding the minimal SSR. The case of more than two independent variables is similar, but more general. The estimated regression function is (𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1 weights to be determined when the number of inputs is 𝑟. Polynomial Regression in Python You can regard polynomial regression as a generalized case of linear regression. You assume the polynomial dependence between the output and inputs and, consequently, the polynomial estimated regression function. In other words, in addition to linear terms like 𝑏₁𝑥₁, your regression function 𝑓 can include nonlinear terms such as 𝑏₂𝑥₁², 𝑏₃𝑥₁³, or even 𝑏₄𝑥₁𝑥₂, 𝑏₅𝑥₁²𝑥₂. The simplest example of polynomial regression has a single independent variable, and the estimated regression function is a polynomial of degree two: (𝑥) = 𝑏₀ + 𝑏₁𝑥 + 𝑏₂𝑥². Now, remember that you want to calculate 𝑏₀, 𝑏₁, and 𝑏₂ to minimize SSR. These are your unknowns. Keeping this in mind, compare the previous regression function with the function (𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂, used for linear regression. They look very similar and are both linear functions of the unknowns 𝑏₀, 𝑏₁, and 𝑏₂. This is why you can solve the polynomial regression problem as a linear problem with the term 𝑥² regarded as an input variable. In the case of two variables and the polynomial of degree two, the regression function has this form: (𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂ + 𝑏₃𝑥₁² + 𝑏₄𝑥₁𝑥₂ + 𝑏₅𝑥₂². The procedure for solving the problem is identical to the previous case. You apply linear regression for five inputs: 𝑥₁, 𝑥₂, 𝑥₁², 𝑥₁𝑥₂, and 𝑥₂². As the result of regression, you get the values of six weights that minimize SSR: 𝑏₀, 𝑏₁, 𝑏₂, 𝑏₃, 𝑏₄, and 𝑏₅. Underfitting and Overfitting One very important question that might arise when you’re implementing polynomial regression is related to the choice of the optimal degree of the polynomial regression function. There’s no straightforward rule for doing this. It depends on the case. You should, however, be aware of two problems that might follow the choice of the degree: underfitting and overfitting. Underfitting occurs when a model can’t accurately capture the dependencies among data, usually as a consequence of its own simplicity. It often yields a low 𝑅² with known data and bad generalization capabilities when applied with new data. Overfitting happens when a model learns both data dependencies and random fluctuations. In other words, a model learns the existing data too well. Complex models, which have many features or terms, are often prone to overfitting. When applied to known data, such models usually yield high 𝑅². However, they often don’t generalize well and have significantly lower 𝑅² when used with new data. Errors and Exceptions in python An error is an issue in a program that prevents the program from completing its task. In comparison, an exception is a condition that interrupts the normal flow of the program. Both errors and exceptions are a type of runtime error, which means they occur during the execution of a program. Errors are the problems in a program due to which the program will stop the execution. On the other hand, exceptions are raised when some internal events occur which changes the normal flow of the program. Two types of Error occur in python. 1. Syntax errors 2. Logical errors (Exceptions) 1. Syntax errors When the proper syntax of the language is not followed then a syntax error is thrown. Example: # initialize the amount variable amount = 10000 # check that You are eligible to # purchase Dsa Self Paced or not if(amount>2999) print("You are eligible to purchase Dsa Self Paced") It returns a syntax error message because after the if statement a colon: is missing. We can fix this by writing the correct syntax. 2. logical errors(Exception) When in the runtime an error that occurs after passing the syntax test is called exception or logical type. For example, when we divide any number by zero then the ZeroDivisionError exception is raised, or when we import a module that does not exist then ImportError is raised. Example: # initialize the amount variable marks = 10000 # perform division with 0 a = marks / 0 print(a) In the above example the ZeroDivisionError as we are trying to divide a number by 0. Example: if(a