Data Analytics-Week3- Python Overview.pdf
Document Details
Uploaded by NavigablePlatypus
Sheridan College
Full Transcript
1 Data Analytics Dr. Ameera Al-Karkhi Lecture 3 Subject Code: ENGI 55612 2024 2 Python Overview Install Python together with Anaconda individual edition: ▪ Anaconda Python is the perfect platform for beginners who want to l...
1 Data Analytics Dr. Ameera Al-Karkhi Lecture 3 Subject Code: ENGI 55612 2024 2 Python Overview Install Python together with Anaconda individual edition: ▪ Anaconda Python is the perfect platform for beginners who want to learn Python. Since Anaconda Python is free and open-source, anyone can contribute to its development. ▪ How to install Anaconda for Python? 1. Go to Anaconda website, find the installation page for Anaconda Individual Edition: http://docs.anaconda.com/anaconda/install/ 2. Follow the instructions to download the installer for your operating system. Choose the version that is compatible to your computer as well as the most updated. 3. Follow the prompts and agree to the terms and conditions. When you are asked if you want to "add Anaconda to my PATH environment variable," make sure that you select "yes." This will ensure that Anaconda is added to your system's PATH, which is a list of directories that your operating system uses to find the files it needs. 3 Anaconda Navigator Anaconda Navigator is a desktop graphical user interface that allows you to launch applications and efficiently manage conda packages, environments, and channels without using command-line commands. 4 Spyder Spyder is a Python development environment with many features for working with Python code, such as a text editor, debugger, profiler, and interactive console. Jupyter Notebook Jupyter Notebook is a web-based application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Anaconda Prompt or Terminal Anaconda Prompt is a command line interface with Anaconda Distribution. Terminal is a command line interface that comes with macOS and Linux. https://blog.hubspot.com/website/anaconda-python 5 Package A package is a collection of modules installed using Conda. A module is a Python file that has a.py extension. Environment An environment is a directory that contains all the files needed for a particular application, such as Python interpreter, packages, and configuration files. You can use Conda to create separate environments for different projects. 6 How to Use Anaconda for Python ▪ Get started with Anaconda Navigator by launching an application. Then, create and run a simple Python program with Spyder or Jupyter Notebook. ✓ With Spyder: - To launch Spyder, open the Anaconda Navigator and select the Spyder IDE from the list of applications. - Once Spyder is open, you can write your first Python program in the text editor. For instance, you can input print("Hello World"). Once complete, click File > Save as and name your program. To run your program, click the Run icon or press F11 on your keyboard. ✓ With Jupyter Notebook: - To launch Jupyter Notebook, open the Anaconda Navigator and select Jupyter Notebook from the list of applications. - Once Jupyter Notebook is open, you can create a new Python file by selecting New > Python from the top right corner. A new file will be created that you can rename by clicking on the file name. On the first line of the notebook, write some code such as print("Hello Anaconda"). To run your Python code in Jupyter Notebook, select the cell you want to run, then click on the Run icon or press Shift + Enter on your keyboard. 7 Installing Python without Anaconda Open the Command Prompt and Check if Python is Installed, type in: $ python --version $ python3 --version Python was not found, so we need to go ahead and install it. Install Python from: https://www.python.org/ https://www.jcchouinard.com/install-python-on-windows/ 8 Go to your download folder and double-click on the file to start the Python installer. Go through each of the prompts. 9 Make sure to check “Add Python 3.X to PATH”. This step allows the Python executable to be found when typing “python” in the command line. Now re-check if Python Was Installed Properly: $ python3 --version 10 Check if Pip is Installed; $ pip3 –version If pip is not installed, you can run: $ python3 get-pip.py Run Python In the command prompt, type in “python”. This will open the Python interactive window allowing you to type Python code. 11 Python for Data Visualization 12 Main Data Types of Python Data types represent the ways in which information can be stored in Python. The most important data types of Python, including: ▪ Strings; e.g., “hello there” ▪ floats (numbers that contain decimal); e.g., 0.25 ▪ Integer (whole numbers w/o a decimal point); e.g., 100 ▪ Boolean ; e.g., 0/1, F/T ▪ List; A collection of data sits between [ ] e.g.; [1,2,3,4,5] ▪ Tuples; A collection of data, sits between ( ) ,e.g. (1,2,3,4,5) ▪ Dictionaries : A collection of data, sits between { }, e.g. {"a":1, "b":2, "c":3} 13 Data types and data structures Python uses several data types to represent data. The most important data types in Python are shown in the following table. Basic data types in Python. 14 Data can be organized into more complex data structures known as containers. Common containers are listed in the table below. 15 16 Naming Variables in Python When you have a value that you want to use multiple times, it can sometimes be easier to store it as a variable (nickname) for example, let’s say I want to use the value of pi (3.1415) many times in my code. Instead of typing out this value each time, I can assign it to a variable which I’ll call “pi.” pi = 3.14159 ▪ In Python, variable names can only contain alphanumeric characters and underscores (A-z, 0-9, and _). ▪ Variable names can’t start with a number. ▪ Variable names are case-sensitive (e.g. a variable named var is different from a variable named Var, and these are both different from a variable named VAR). #Step 1: Assign your data to variables example_1 = "Hello World" example_2 = 254 example_3 = 25.43 example_4 = ["Anna", "Bella", "Cora"] #Step 2: Check data types print(type(example_1)) print(type(example_2)) print(type(example_3)) print(type(example_4)) 17 Type conversion To convert variables from one type to another (i.e. integers to floats), we use type conversions as follows: # Convert from integer to float int_to_float = float(15) #These codes converts the data value print(int_to_float) #These codes will print the converted data value print(type(int_to_float)) #These codes will print the data type # Convert from float to integer float_to_int = int(23.56) print(float_to_int) print(type(float_to_int)) #Convert from integer to string int_to_string = str(51) print(int_to_string) print(type(int_to_string)) #Convert from string to integer string_to_integer = int("6589") print(string_to_integer) print(type(string_to_integer)) 18 Q1. Determine and print the type of the following: ▪ variable1 = 123 ▪ variable2 = "123" ▪ variable3 = 123.456 #QUESTION 1 variable1 = 123 variable2 = "123“ variable3 = 123.456 #These codes print the datatypes print(type(variable1)) print(type(variable2)) print(type(variable3)) print("\n") #This prints an empty line. Q2. Convert the following variables print the result ▪ Convert this float into an integer variable4 = 23.0 ▪ Convert this string into an integer variable5 = "6000" #QUESTION 2 variable4 = int(23.0) #These codes converts the data value print(variable4) #These codes will print the converted data value print(type(variable4)) #These codes will print the data type print("\n") #This prints an empty line variable5 = int("6000") #These codes converts the data value print(variable5) #These codes will print the converted data value print(type(variable5)) #These codes will print the data type 19 Preliminary Exploration in Python loading data, viewing it, summary statistics 20 What is Pandas? In order to perform data analysis on data, it must be first structured in a manner which we are able to manipulate and perform operations on, a common way in python in which this is done is through the pandas module. Pandas is an open source, easy-to-use data structures and data analysis tools for the Python programming language. We will be going through the basic understanding and applications of pandas for data analysis along with other modules often used in conjunction with pandas to achieve these goals. Importing Modules import matplotlib as plt # This is to ensure that the plot is displayed below the cell that contains the plotting code. 21 Basic Syntaxes and functions of pandas 1. df.head() - this shows you a mini preview of your data so you can get an idea of what column headers and the type of data within the file 2. df.tail() - this shows you a mini preview of your data so you can get an idea of what column tails and the type of data within the file 3. df.shape - This returns the number of rows and columns in your dataset in a vector output (no. of rows, no. of columns) 4. df.columns - Returns the list of column headers and the datatype of these headers 5. df.info() - Returns detailed information about your dataset 6. df.describe() - Returns detailed statiscal information about your dataset 22 Practice Activity: 1. Output the first few rows of data (Employee compensation.csv) 2. Output data dimensions (i.e. the number of rows and columns of the dataset) 3. Output the column header names 4. Obtain detailed information about the dataset 5. Output statistical information about the dataset 23 Manipulating the dataset We are able to change certain features of our dataset from what it originally was through pandas. Here are some of the ways we can do that: 1. Changing datatypes 2. Sorting the datatypes Changing datatypes If we have certain data that are not in our desired datatype we can simply reassign it as show below. We make use of our basic pandas function, df.info(), to see the changes made. print(df.info()) df['net_sales'] = df['net_quantity'].astype('float64’) print(df.info()) 24 Sorting the data We can now make changes to the order of the dataset according to certain rules we want. """ascending=False as we want to to sort in descending order""" df.sort_values(by='net_sales', ascending=False).head() """we can sort to multiple and specific columns""" df.sort_values(by=['order_fufilled', 'net_sales'], ascending=[False, True]).head() 25 1. Open Anaconda-Navigator and launch a ‘jupyter’ notebook. It opens a new browser window. 2. Navigate to the directory where your csv file is saved and open a new Python notebook. import pandas as pd # Load data housing_df = pd.read_csv(“WestRoxbury.csv”) housing_df.shape #find dimension of data frame housing_df.head() #show the 1st five rows print(housing_df) #show all the data # Rename columns: replace spaces with '_’ housing_df = housing_df.rename(columns={'TOTAL VALUE ': 'TOTAL_VALUE'}) # explicit housing_df.columns = [s.strip().replace(' ', '_') for s in housing_df.columns] # all columns bostonHousing_df =bostonHousing_df.rename(columns={'CAT.MEDV': 'CAT_MEDV'}) Calling specific attributes of the dataset (Finding data) : loc and iloc loc: used for filtering rows and selecting columns by using labels (index or column names) which stands on indexes 0 to length-1: Format: loc[row , column] Example: row 0 to all columns #print row 0 to all columns print("row 0 to all columns") housing_df.loc[0,:] print(housing_df) #print first 3 rows to all columns print("three row to all columns") print(housing_df.loc[[0,1,2],:]) 28 iloc: used for filtering rows and columns based on its integer position which stands on indexes 0 to length-1. It's primarily used when you know the positions of the rows and columns. Format: iloc[row , column] Example: row 0 to all columns print("------df.columns names-------") print(housing_df.columns) print("------iloc-------") print(housing_df.iloc[:,0:2]) # 2-1 print(housing_df.iloc[0:2,:]) # 2-1 29 Columns # Show first four rows of the data housing_df.loc[0:3] # loc[a:b] gives rows a to b, inclusive housing_df.iloc[0:4] # iloc[a:b] gives rows a to b-1 # Different ways of showing the first 10 # values in column TOTAL_VALUE # use dot notation if the column name has no spaces housing_df.iloc[0:10].TOTAL_VALUE # Show the fifth row of the first 10 columns housing_df.iloc[4, 0:10] # use a slice to return a data frame housing_df.iloc[4:5, 0:10] # To specify a full column, use: housing.iloc[:,0:1] housing.TOTAL_VALUE housing_df['TOTAL_VALUE'][0:10] # show the first 10 rows of the first column # Descriptive statistics # show length of first column print('Number of rows ', len(housing_df['TOTAL_VALUE'])) # show mean of column print('Mean of TOTAL_VALUE ', housing_df['TOTAL_VALUE'].mean()) # show summary, statistics for each column housing_df.describe() Data structures in Python ▪ Vector : 1 column or row of data 1 type (numeric or text) ▪ Matrix : is a two-dimensional (r × c) object (think a bunch of stacked or side-by-side vectors). All elements in a matrix must be of the same data type type (numeric or text) ▪ Array is a three-dimensional (r × c × h) object (think a bunch of stacked r × c matrices). ▪ Data Frame : multiple columns and/or rows of Data (multiple inputs) ▪ Lists: could be vectors, arrays, data frames and lists. 34 Difference between List and Tuple in Python ▪ Mutability is the primary distinction between List and Tuple in Python. Changeable lists allow for the processing of dynamic data because you can modify their elements or size after creation. Tuples, on the other hand, can never have their components or size modified after they have been defined since they are immutable. Tuples are suitable because of their immutability in cases where data stability and integrity are crucial, but lists are more adaptive in circumstances requiring frequent updates or modifications.. List Tuple 35 Column Selection and Ordering/ Sorting # Below are quick example # Default sort df2 = df.sort_values('Courses') # Sort by Descending df2 = df.sort_values('Courses', ascending=False) # Sort by multiple columns df2 = df.sort_values(by=['Courses','Fee']) # Sort and ignore index df2 = df.sort_values(by='Courses', ignore_index=True) 36 Concatenating Data Frames import pandas as pd # First DataFrame df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'], 'Name': ['ABC', 'PQR', 'DEF', 'GHI']}) # Second DataFrame df2 = pd.DataFrame({'id': ['B05', 'B06', 'B07', 'B08'], 'Name': ['XYZ', 'TUV', 'MNO', 'JKL']}) frames = [df1, df2] result = pd.concat(frames) display(result) 37 Column Binding (pasting columns) and Row Binding (pasting rows) 38 Example 1: Column-bind Two pandas Data Frames import pandas as pd # Create first pandas Data Frame data_cbind_1 = pd.DataFrame({"x1":range(10, 16), "x2":range(30, 24, - 1), "x3":["a", "b", "c", "d", "e", "f"], "x4":range(48, 42, - 1)}) # Print first pandas Data Frame print(data_cbind_1) # Create second pandas DataFrame data_cbind_2 = pd.DataFrame({"y1":["foo", "bar", "bar", "foo", "foo", "bar"], "y2":["x", "y", "z", "x", "y", "z"], "y3":range(18, 0, - 3)}) # Print second pandas DataFrame print(data_cbind_2) Note that these two data sets have the same number of rows. This is important when applying a column-bind. https://statisticsglobe.com/rbind-cbind-pandas-dataframe-python 39 # Cbind DataFrames data_cbind_all = pd.concat([data_cbind_1.reset_index(drop = True), data_cbind_2], axis = 1) print(data_cbind_all) 40 Example 2: Column-bind Two pandas Data Frames Using ID Column This example illustrates how to use an ID column to match particular observations in two DataFrames to each other. ▪ This way, we do not have to take care of the ordering of the rows in the input DataFrames, and we can also merge DataFrames with a different number of rows. # Create first pandas DataFrame data_merge_1 = pd.DataFrame({"ID":range(1, 5), "x1":range(10, 14), "x2":range(30, 26, - 1), "x3":["a", "b", "c", "d"], "x4":range(48, 44, - 1)}) print(data_merge_1) 41 # Create second pandas DataFrame data_merge_2 = pd.DataFrame({"ID":range(3, 9), "y1":["foo", "bar", "bar", "foo", "foo", "bar"], "y2":["x", "y", "z", "x", "y", "z"], "y3":range(18, 0, - 3)}) print(data_merge_2) 42 This time, both of our example Data Frames contains an ID column. However, the values and the number of rows in these data sets are different. If we want to bind the columns of those two data sets based on an ID column, we can use the merge function as shown below: # Cbind DataFrames data_merge_all = pd.merge(data_merge_1, data_merge_2, on = "ID", how = "outer") print(data_merge_all) As you can see, the IDs of these two Data Frames have been matched. In case an ID did not exist in one of the data sets, NaN values have been inserted. 43 Example 3: Combine pandas Data Frames Vertically ▪ How to stack two Data Frames on top of each other, i.e. row-binding two Data Frames. # Create first pandas Data Frame data_rbind_1 = pd.DataFrame({"x1":range(11, 16),"x2":["a", "b", "c", "d", "e"], "x3":range(30, 25, - 1), "x4":range(30, 20, - 2)}) print(data_rbind_1) 44 # Create second pandas DataFrame data_rbind_2 = pd.DataFrame({"x1":range(3, 10), "x2":["x", "y", "y", "y", "x", "x", "y"], "x3":range(20, 6, - 2), "x4":range(28, 21, - 1)}) print(data_rbind_2) Note: Both of these Data Frames contain the same variables (i.e. x1, x2, x3, and x4). 45 # Rbind DataFrames data_rbind_all = pd.concat([data_rbind_1, data_rbind_2], ignore_index = True, sort = False) print(data_rbind_all) 46