Importing Data in Python I PDF
Document Details
Uploaded by MeritoriousConstructivism363
Tags
Summary
This document serves as a tutorial in Python on importing data from various resources. It discusses flat files, including .txt and .csv types, as well as importation from other applications and relational databases. Code examples are included to provide a hands-on approach to learning the topic.
Full Transcript
Data Loading IMPORTING DATA IN PYTHON I Importing Data in Python I Import data Flat files, e.g..txts,.csvs Files from other so!ware Relational databases Importing Data in Python I Plain text files Source: Projec...
Data Loading IMPORTING DATA IN PYTHON I Importing Data in Python I Import data Flat files, e.g..txts,.csvs Files from other so!ware Relational databases Importing Data in Python I Plain text files Source: Project Gutenberg Importing Data in Python I Table data row titanic.csv Name Sex Cabin Survived Braund, Mr. Owen Harris male NaN 0 Cumings, Mrs. John Bradley female C85 1 Heikkinen, Miss. Laina female NaN 1 Futrelle, Mrs. Jacques Heath female C123 1 Allen, Mr. William Henry male NaN 0 column Flat file Source: Kaggle Importing Data in Python I Reading a text file In : filename = 'huck_finn.txt' In : file = open(filename, mode='r') # 'r' is to read In : text = file.read() In : file.close() Importing Data in Python I Printing a text file In : print(text) YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before. Importing Data in Python I Writing to a file In : filename = 'huck_finn.txt' In : file = open(filename, mode='w') # 'w' is to write In : file.close() Importing Data in Python I Context manager with In : with open('huck_finn.txt', 'r') as file:...: print(file.read()) YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before. IMPORTING DATA IN PYTHON I The importance of flat files in data science Importing Data in Python I Flat files column titanic.csv PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fa re,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S row Name Gender Cabin Survived Braund, Mr. Owen Harris male NaN 0 Cumings, Mrs. John Bradley female C85 1 Heikkinen, Miss. Laina female NaN 1 Futrelle, Mrs. Jacques Heath female C123 1 Allen, Mr. William Henry male NaN 0 Importing Data in Python I Flat files Text files containing records That is, table data Record: row of fields or a!ributes Column: feature or a!ribute titanic.csv PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fa re,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C Importing Data in Python I Header titanic.csv PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female, 35,1,0,113803,53.1,C123,S 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female, 27,0,2,347742,11.1333,,S Importing Data in Python I File extension.csv - Comma separated values.txt - Text file commas, tabs - Delimiters Importing Data in Python I Tab-delimited file MNIST.txt pixel149 pixel150 pixel151 pixel152 pixel153 0 0 0 0 0 86 250 254 254 254 0 0 0 9 254 0 0 0 0 0 103 253 253 253 253 0 0 5 165 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 253 253 253 253 253 MNIST image Importing Data in Python I How do you import flat files? Two main packages: NumPy, pandas Here, you’ll learn to import: Flat files with numerical data (MNIST) Flat files with numerical data and strings (titanic.csv) IMPORTING DATA IN PYTHON I Importing flat files using NumPy Importing Data in Python I Why NumPy? NumPy arrays: standard for storing numerical data Essential for other packages: e.g. scikit-learn loadtxt() genfromtxt() Importing Data in Python I Importing flat files using NumPy In : import numpy as np In : filename = 'MNIST.txt' In : data = np.loadtxt(filename, delimiter=',') In : data Out: [[ 0. 0. 0. 0. 0.] [ 86. 250. 254. 254. 254.] [ 0. 0. 0. 9. 254.]..., [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] Importing Data in Python I Customizing your NumPy import In : import numpy as np In : filename = 'MNIST_header.txt' In : data = np.loadtxt(filename, delimiter=',', skiprows=1) In : print(data) [[ 0. 0. 0. 0. 0.] [ 86. 250. 254. 254. 254.] [ 0. 0. 0. 9. 254.]..., [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] Importing Data in Python I Customizing your NumPy import In : import numpy as np In : filename = 'MNIST_header.txt' In : data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0, 2]) In : print(data) [[ 0. 0.] [ 86. 254.] [ 0. 0.]..., [ 0. 0.] [ 0. 0.] [ 0. 0.]] Importing Data in Python I Customizing your NumPy import In : data = np.loadtxt(filename, delimiter=',', dtype=str) Importing Data in Python I Mixed datatypes titanic.csv Name Gender Cabin Fare Braund, Mr. Owen Harris male NaN 7.3 Cumings, Mrs. John Bradley female C85 71.3 Heikkinen, Miss. Laina female NaN 8.0 Futrelle, Mrs. Jacques Heath female C123 53.1 Allen, Mr. William Henry male NaN 8.05 strings floats Source: Kaggle IMPORTING DATA IN PYTHON I Importing flat files using pandas Importing Data in Python I What a data scientist needs Two-dimensional labeled data structure(s) Columns of potentially different types Manipulate, slice, reshape, groupby, join, merge Perform statistics Work with time series data Importing Data in Python I Pandas and the DataFrame DataFrame = pythonic analog of R’s data frame Importing Data in Python I Pandas and the DataFrame Importing Data in Python I Manipulating pandas DataFrames Exploratory data analysis Data wrangling Data preprocessing Building models Visualization Standard and best practice to use pandas Importing Data in Python I Importing using pandas In : import pandas as pd In : filename = 'winequality-red.csv' In : data = pd.read_csv(filename) In : data.head() Out: volatile acidity citric acid residual sugar 0 0.70 0.00 1.9 1 0.88 0.00 2.6 2 0.76 0.04 2.3 3 0.28 0.56 1.9 4 0.70 0.00 1.9 In : data_array = data.values Importing Data in Python I Next: Import other file types: Excel, SAS, Stata Feather Interact with relational databases IMPORTING DATA IN PYTHON I Introduction to other file types Importing Data in Python I Other file types Pickled files Excel spreadsheets MATLAB files SAS files Stata files Importing Data in Python I Pickled files File type native to Python Motivation: many datatypes for which it isn’t obvious how to store them Pickled files are serialized Serialize = convert object to bytestream Importing Data in Python I Pickled files In : import pickle In : with open('pickled_fruit.pkl', 'rb') as file:...: data = pickle.load(file) In : print(data) {'peaches': 13, 'apples': 4, 'oranges': 11} Importing Data in Python I Importing Excel spreadsheets In : import pandas as pd In : file = 'urbanpop.xlsx' In : data = pd.ExcelFile(file) In : print(data.sheet_names) ['1960-1966', '1967-1974', '1975-2011'] In : df1 = data.parse('1960-1966') sheet name, as a string In : df2 = data.parse(0) sheet index, as a float Importing Data in Python I other code import pandas as pd # File path to your Excel file file_path = 'urbanpop.xlsx # Read the Excel file, skipping the first two rows # Use the 'usecols' parameter to specify the columns by index or name df = pd.read_excel(file_path, skiprows=2, usecols=['A', 'C', 'E']) # Rename the columns df.rename(columns={'A': 'F1', 'C': 'F2', 'E': 'F3'}, inplace=True) # View the first few rows to confirm print(df.head()) IMPORTING DATA IN PYTHON I Importing SAS/Stata files using pandas Importing Data in Python I SAS and Stata files SAS: Statistical Analysis System Stata: “Statistics” + “data” SAS: business analytics and biostatistics Stata: academic social sciences research Importing Data in Python I SAS files Used for: Advanced analytics Multivariate analysis Business intelligence Data management Predictive analytics Standard for computational analysis Importing Data in Python I Importing SAS files In : import pandas as pd In : from sas7bdat import SAS7BDAT In : with SAS7BDAT('urbanpop.sas7bdat') as file:...: df_sas = file.to_data_frame() Importing Data in Python I Importing Stata files In : import pandas as pd In : data = pd.read_stata('urbanpop.dta') IMPORTING DATA IN PYTHON I Importing MATLAB files Importing Data in Python I MATLAB “Matrix Laboratory” Industry standard in engineering and science Data saved as.mat files Importing Data in Python I SciPy to the rescue! scipy.io.loadmat() - read.mat files scipy.io.savemat() - write.mat files Importing Data in Python I What is a.mat file? Importing Data in Python I Importing a.mat file In : import scipy.io In : filename = 'workspace.mat' In : mat = scipy.io.loadmat(filename) In : print(type(mat)) In : print(type(mat['x'])) keys = MATLAB variable names values = objects assigned to variables IMPORTING DATA IN PYTHON I Introduction to relational databases Importing Data in Python I What is a relational database? Based on relational model of data First described by Edgar “Ted” Codd Importing Data in Python I Example: Northwind database Orders table Employees table Customers table Importing Data in Python I The Orders table Importing Data in Python I Tables are linked Orders table Employees table Customers table Importing Data in Python I Relational model Widely adopted Todd’s 12 Rules/Commandments Consists of 13 rules (zero-indexed!) Describes what a Relational Database Management System should adhere to to be considered relational Importing Data in Python I Relational Database Management Systems PostgreSQL MySQL SQLite SQL = Structured Query Language IMPORTING DATA IN PYTHON I Creating a database engine in Python Importing Data in Python I Creating a database engine SQLite database Fast and simple SQLAlchemy Works with many Relational Database Management Systems In : from sqlalchemy import create_engine In : engine = create_engine('sqlite:///Northwind.sqlite') Importing Data in Python I Ge!ing table names In : from sqlalchemy import create_engine, inspect In : engine = create_engine('sqlite:///Northwind.sqlite') Create an inspector In : inspector = inspect(engine) # Get the table names from the database In : table_names = inspector.get_table_names() In : print(table_names) ['Categories', 'Customers', 'EmployeeTerritories', 'Employees', 'Order Details', 'Orders', 'Products', 'Region', 'Shippers', 'Suppliers', 'Territories'] IMPORTING DATA IN PYTHON I Querying relational databases in Python Importing Data in Python I Basic SQL query SELECT * FROM Table_Name Returns all columns of all rows of the table Example: SELECT * FROM Orders We’ll use SQLAlchemy and pandas Importing Data in Python I Workflow of SQL querying Import packages and functions Create the database engine Connect to the engine Query the database Save query results to a DataFrame Close the connection Importing Data in Python I Your first SQL query In : from sqlalchemy import create_engine, text In : import pandas as pd In : engine = create_engine('sqlite:///Northwind.sqlite') In : con = engine.connect() In : rs = con.execute(text("SELECT * FROM Orders")) In : df = pd.DataFrame(rs.fetchall()) In : con.close() Importing Data in Python I Printing your query results In : print(df.head()) 0 1 2 3 4 0 10248 VINET 5 7/4/1996 12:00:00 AM 8/1/1996 12:00:00 AM 1 10251 VICTE 3 7/8/1996 12:00:00 AM 8/5/1996 12:00:00 AM 2 10254 CHOPS 5 7/11/1996 12:00:00 AM 8/8/1996 12:00:00 AM 3 10256 WELLI 3 7/15/1996 12:00:00 AM 8/12/1996 12:00:00 AM 4 10258 ERNSH 1 7/17/1996 12:00:00 AM 8/14/1996 12:00:00 AM Importing Data in Python I Set the DataFrame column names In : from sqlalchemy import create_engine, text In : import pandas as pd In : engine = create_engine('sqlite:///Northwind.sqlite') In : con = engine.connect() In : rs = con.execute(text("SELECT * FROM Orders")) In : df = pd.DataFrame(rs.fetchall()) In : df.columns = rs.keys() In : con.close() Importing Data in Python I Set the data frame column names In : print(df.head()) OrderID CustomerID EmployeeID OrderDate 0 10248 VINET 5 7/4/1996 12:00:00 AM 1 10251 VICTE 3 7/8/1996 12:00:00 AM 2 10254 CHOPS 5 7/11/1996 12:00:00 AM 3 10256 WELLI 3 7/15/1996 12:00:00 AM 4 10258 ERNSH 1 7/17/1996 12:00:00 AM Importing Data in Python I Using the context manager In : from sqlalchemy import create_engine In : import pandas as pd In : engine = create_engine('sqlite:///Northwind.sqlite') In : with engine.connect() as con:...: rs = con.execute(text("SELECT OrderID, OrderDate, ShipName FROM Orders"))...: df = pd.DataFrame(rs.fetchmany(size=5))...: df.columns = rs.keys() IMPORTING DATA IN PYTHON I Querying relational databases directly with pandas Importing Data in Python I The pandas way to query In : from sqlalchemy import create_engine In : import pandas as pd In : engine = create_engine('sqlite:///Northwind.sqlite') In : with engine.connect() as con:...: rs = con.execute(text("SELECT * FROM Orders"))...: df = pd.DataFrame(rs.fetchall())...: df.columns = rs.keys() In : df = pd.read_sql_query(text("SELECT * FROM Orders"), engine) IMPORTING DATA IN PYTHON I Advanced querying: exploiting table relationships Importing Data in Python I Tables are linked Orders table Employees table Customers table Importing Data in Python I JOINing tables Orders table Customers table Importing Data in Python I INNER JOIN in Python (pandas) In : from sqlalchemy import create_engine In : import pandas as pd In : engine = create_engine('sqlite:///Northwind.sqlite') In : df = pd.read_sql_query(text("SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID"), engine) In : print(df.head()) OrderID CompanyName 0 10248 Vins et alcools Chevalier 1 10251 Victuailles en stock 2 10254 Chop-suey Chinese 3 10256 Wellington Importadora 4 10258 Ernst Handel Importing Data in Python I What you’ve learned: Relational databases Queries SELECT WHERE JOIN Importing Data in Python I Next : Scrape data from the web Interact with APIs