Pandas Data Handling - PDF

SAJITH.K(BS) Data Handling using Pandas -1 Series Python Library – Matplotlib Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.It is used to create 1. Develop publication quality plots with just a few lines of code 2. Use interactive figures that can zoom, pan, update... We can customize and Take full control of line styles, font properties, axes properties... as well as export and embed to a number of file formats and interactive environments Python Library – Pandas It is a most famous Python package for data science, which offers powerful and flexible data structures that make data analysis and manipulation easy.Pandas makes data importing and data analyzing much easier. Pandas builds on packages like NumPy and matplotlib to give us a single & convenient place for data analysis and visualization work.  Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures.  Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.  Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.  It is a package useful for data analysis and manipulation.  Pandas provide an easy way to create, manipulate and wrangle the data. SERIES Page 1 SAJITH.K(BS)  Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures Basic Features of Pandas 1. Dataframe object help a lot in keeping track of our data. 2. With a pandas dataframe, we can have different data types (float, int, string, datetime, etc) all in one place 3. Pandas has built in functionality for like easy grouping & easy joins of data, rolling windows 4. Good IO capabilities; Easily pull data from a MySQL database directly into a data frame 5. With pandas, you can use patsy for R-style syntax in doing regressions. 6. Tools for loading data into in-memory data objects from different file formats. 7. Data alignment and integrated handling of missing data. 8. Reshaping and pivoting of data sets. 9. Label-based slicing, indexing and subsetting of large data sets. Data scientists use Pandas for its following advantages:  Easily handles missing data.  It uses Series for one-dimensional data structure and DataFrame for multi- dimensional data structure.  It provides an efficient way to slice the data.  It provides a flexible way to merge, concatenate or reshape the data. DATA STRUCTURE IN PANDAS A data structure is a way to arrange the data in such a way that so it can be accessed quickly and we can perform various operation on this data like- retrieval, deletion, modification etc. Pandas deals with 3 data structure 1. Series 2. Data Frame 3. Panel We are having only series and data frame in our syllabus. SERIES Page 2 SAJITH.K(BS) Series Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, … 10 23 56 17 52 61 73 90 26 72 Key Points  Homogeneous data  Size Immutable  Values of Data Mutable  Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).  The axis labels are collectively called index.  Pandas Series is nothing but a column in an excel sheet.  A Series cannot contain multiple columns. It has two parts 1. Data part (An array of actual data) 2. Associated index with data (associated array of indexes or data labels) SERIES Page 3 SAJITH.K(BS)  We can say that Series is a labeled one-dimensional array which can hold any type of data.  Data of Series is always mutable, means it can be changed.  But the size of Data of Series is always immutable, means it cannot be changed.  Series may be considered as a Data Structure with two arrays out which one array works as Index (Labels) and the second array works as original Data.  Row Labels in Series are called Index. Syntax to create a Series: A pandas Series can be created using the following constructor pandas.Series( data, index, dtype, copy) The parameters of the constructor are as follows Sr.No Parameter & Description 1 data data takes various forms like ndarray, list, constants 2 index Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed. 3 dtype dtype is for data type. If None, data type will be inferred 4 copy Copy data. Default False SERIES Page 4 SAJITH.K(BS) CREATION OF SERIES 1. CREATE OF AN EMPTY SERIES A basic series, which can be created is an Empty Series which means it will not have any value. 2. CREATE A SERIES FROM NDARRAY Without index Note : default index is starting from 0 With index position Note : index is starting from 100 SERIES Page 5 SAJITH.K(BS) 3. CREATE A SERIES FROM DICTIONARY If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index. If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary. Without index Observe − Dictionary keys are used to construct index. With index Observe − Index order is persisted and the missing element is filled with NaN (Not a Number) 4. CREATE A SERIES FROM SCALAR If data is a scalar value, an index must be provided. The value will be repeated to match the length of index SERIES Page 6 SAJITH.K(BS) 5. CREATE A SERIES FROM LIST In order to create a series from list, we have to first create a list after that we can create a series from list. head and tail functions Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series head() head (): It is used to access the first 5 rows of a series. Note: To access first 3 rows we can call series_name.head(3) Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series. tail() tail(): It is used to access the last 5 rows of a series. Note :To access last 4 rows we can call series_name.tail (4) SERIES Page 7 SAJITH.K(BS) Mathematical Operations in Series SERIES Page 8 SAJITH.K(BS) Attributes of Series index: Returns the index (labels) of the Series. values: Returns the values of the Series as a NumPy array. name: Returns the name of the Series. empty: prints True if the series is empty, and False otherwise dtype: Returns the data type of the elements in the Series. shape: Returns a tuple representing the shape of the Series. For a one-dimensional Series, the shape is (n,). index.name: Assigns a name to the index of the series size or len(series): Returns the number of elements in the Series. import pandas as pd # Creating a Series data = {'a': 10, 'b': 20, 'c': 30} series = pd.Series(data) print("Index of the Series:", series.index) # Accessing the index attribute print("Values of the Series:", series.values) # Accessing the values attribute print("Name of the Series:", series.name) # Accessing the name attribute series.name='IIS' # Naming the Series print("Name of the Series:", series.name) print("Data type of the Series:", series.dtype) # Accessing the dtype attribute print("Size of the Series:", series.size) # Accessing the size attribute print("Shape of the Series:", series.shape) # Accessing the shape attribute series.index.name="ideal" # Naming the index of series print("Name of index of the Series:", series.index.name) # Accessing index name attribute print(series) SERIES Page 9

Pandas Data Handling - PDF

Document Details

Tags

Related

Summary

Full Transcript