Python Data Analytics With Pandas, NumPy, and Matplotlib PDF

Python Data Analytics With Pandas, NumPy, and Matplotlib — Second Edition — Fabio Nelli Python Data Analytics With Pandas, NumPy, and Matplotlib Second Edition Fabio Nelli Python Data Analytics Fabio Nelli Rome, Italy ISBN-13 (pbk): 978-1-4842-3912-4 ISBN-13 (electronic): 978-1-4842-3913-1 https://doi.org/10.1007/978-1-4842-3913-1 Library of Congress Control Number: 2018957991 Copyright © 2018 by Fabio Nelli This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Todd Green Development Editor: James Markham Coordinating Editor: Jill Balzano Cover image designed by Freepik (www.freepik.com) Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer- sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail [email protected], or visit http://www.apress.com/ rights-permissions. Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/9781484239124. For more detailed information, please visit http://www.apress.com/source-code. Printed on acid-free paper “Science leads us forward in knowledge, but only analysis makes us more aware” This book is dedicated to all those who are constantly looking for awareness Table of Contents About the Author xvii About the Technical Reviewer xix Chapter 1: An Introduction to Data Analysis 1 Data Analysis 1 Knowledge Domains of the Data Analyst 3 Computer Science 3 Mathematics and Statistics 4 Machine Learning and Artificial Intelligence 5 Professional Fields of Application 5 Understanding the Nature of the Data 5 When the Data Become Information 6 When the Information Becomes Knowledge 6 Types of Data 6 The Data Analysis Process 6 Problem Definition 8 Data Extraction 9 Data Preparation 10 Data Exploration/Visualization 10 Predictive Modeling 12 Model Validation 13 Deployment 13 Quantitative and Qualitative Data Analysis 14 Open Data 15 Python and Data Analysis 17 Conclusions 17 v Table of Contents Chapter 2: Introduction to the Python World 19 Python—The Programming Language 19 Python—The Interpreter 21 Python 2 and Python 3 23 Installing Python 23 Python Distributions 24 Using Python 26 Writing Python Code 28 IPython 35 PyPI—The Python Package Index 39 The IDEs for Python 40 SciPy 46 NumPy 47 Pandas 47 matplotlib 48 Conclusions 48 Chapter 3: The NumPy Library 49 NumPy: A Little History 49 The NumPy Installation 50 Ndarray: The Heart of the Library 50 Create an Array 52 Types of Data 53 The dtype Option 54 Intrinsic Creation of an Array 55 Basic Operations 57 Arithmetic Operators 57 The Matrix Product 59 Increment and Decrement Operators 60 Universal Functions (ufunc) 61 Aggregate Functions 62 vi Table of Contents Indexing, Slicing, and Iterating 62 Indexing 63 Slicing 65 Iterating an Array 67 Conditions and Boolean Arrays 69 Shape Manipulation 70 Array Manipulation 71 Joining Arrays 71 Splitting Arrays 72 General Concepts 74 Copies or Views of Objects 75 Vectorization 76 Broadcasting 76 Structured Arrays 79 Reading and Writing Array Data on Files 82 Loading and Saving Data in Binary Files 82 Reading Files with Tabular Data 83 Conclusions 84 Chapter 4: The pandas Library—An Introduction 87 pandas: The Python Data Analysis Library 87 Installation of pandas 88 Installation from Anaconda 88 Installation from PyPI 89 Installation on Linux 90 Installation from Source 90 A Module Repository for Windows 90 Testing Your pandas Installation 91 Getting Started with pandas 92 Introduction to pandas Data Structures 92 The Series 93 vii Table of Contents The DataFrame 102 The Index Objects 112 Other Functionalities on Indexes 114 Reindexing 114 Dropping 117 Arithmetic and Data Alignment 118 Operations Between Data Structures 120 Flexible Arithmetic Methods 120 Operations Between DataFrame and Series 121 Function Application and Mapping 122 Functions by Element 123 Functions by Row or Column 123 Statistics Functions 125 Sorting and Ranking 126 Correlation and Covariance 129 “Not a Number” Data 131 Assigning a NaN Value 131 Filtering Out NaN Values 132 Filling in NaN Occurrences 133 Hierarchical Indexing and Leveling 134 Reordering and Sorting Levels 137 Summary Statistic by Level 138 Conclusions 139 Chapter 5: pandas: Reading and Writing Data 141 I/O API Tools 141 CSV and Textual Files 142 Reading Data in CSV or Text Files 143 Using RegExp to Parse TXT Files 146 Reading TXT Files Into Parts 148 Writing Data in CSV 150 viii Table of Contents Reading and Writing HTML Files 152 Writing Data in HTML 153 Reading Data from an HTML File 155 Reading Data from XML 157 Reading and Writing Data on Microsoft Excel Files 159 JSON Data 162 The Format HDF5 166 Pickle—Python Object Serialization 168 Serialize a Python Object with cPickle 168 Pickling with pandas 169 Interacting with Databases 170 Loading and Writing Data with SQLite3 171 Loading and Writing Data with PostgreSQL 174 Reading and Writing Data with a NoSQL Database: MongoDB 178 Conclusions 180 Chapter 6: pandas in Depth: Data Manipulation 181 Data Preparation 181 Merging 182 Concatenating 188 Combining 191 Pivoting 193 Removing 196 Data Transformation 197 Removing Duplicates 198 Mapping 199 Discretization and Binning 204 Detecting and Filtering Outliers 209 Permutation 210 Random Sampling 211 ix Table of Contents String Manipulation 212 Built-in Methods for String Manipulation 212 Regular Expressions 214 Data Aggregation 217 GroupBy 218 A Practical Example 219 Hierarchical Grouping 220 Group Iteration 222 Chain of Transformations 222 Functions on Groups 224 Advanced Data Aggregation 225 Conclusions 229 Chapter 7: Data Visualization with matplotlib 231 The matplotlib Library 231 Installation 233 The IPython and IPython QtConsole 233 The matplotlib Architecture 235 Backend Layer 236 Artist Layer 236 Scripting Layer (pyplot) 238 pylab and pyplot 238 pyplot 239 A Simple Interactive Chart 239 The Plotting Window 241 Set the Properties of the Plot 243 matplotlib and NumPy 246 Using the kwargs 248 Working with Multiple Figures and Axes 249 Adding Elements to the Chart 251 Adding Text 251 x Table of Contents Adding a Grid 256 Adding a Legend 257 Saving Your Charts 260 Saving the Code 260 Converting Your Session to an HTML File 262 Saving Your Chart Directly as an Image 264 Handling Date Values 264 Chart Typology 267 Line Charts 267 Line Charts with pandas 276 Histograms 277 Bar Charts 278 Horizontal Bar Charts 281 Multiserial Bar Charts 282 Multiseries Bar Charts with pandas Dataframe 285 Multiseries Stacked Bar Charts 286 Stacked Bar Charts with a pandas Dataframe 290 Other Bar Chart Representations 291 Pie Charts 292 Pie Charts with a pandas Dataframe 296 Advanced Charts 297 Contour Plots 297 Polar Charts 299 The mplot3d Toolkit 302 3D Surfaces 302 Scatter Plots in 3D 304 Bar Charts in 3D 306 Multi-Panel Plots 307 Display Subplots Within Other Subplots 307 Grids of Subplots 309 Conclusions 312 xi Table of Contents Chapter 8: Machine Learning with scikit-learn 313 The scikit-learn Library 313 Machine Learning 313 Supervised and Unsupervised Learning 314 Training Set and Testing Set 315 Supervised Learning with scikit-learn 315 The Iris Flower Dataset 316 The PCA Decomposition 320 K-Nearest Neighbors Classifier 322 Diabetes Dataset 327 Linear Regression: The Least Square Regression 328 Support Vector Machines (SVMs) 334 Support Vector Classification (SVC) 334 Nonlinear SVC 339 Plotting Different SVM Classifiers Using the Iris Dataset 342 Support Vector Regression (SVR) 345 Conclusions 347 Chapter 9: Deep Learning with TensorFlow 349 Artificial Intelligence, Machine Learning, and Deep Learning 349 Artificial intelligence 350 Machine Learning Is a Branch of Artificial Intelligence 351 Deep Learning Is a Branch of Machine Learning 351 The Relationship Between Artificial Intelligence, Machine Learning, and Deep Learning 351 Deep Learning 352 Neural Networks and GPUs 352 Data Availability: Open Data Source, Internet of Things, and Big Data 353 Python 354 Deep Learning Python Frameworks 354 Artificial Neural Networks 355 How Artificial Neural Networks Are Structured 355 Single Layer Perceptron (SLP) 357 xii Table of Contents Multi Layer Perceptron (MLP) 360 Correspondence Between Artificial and Biological Neural Networks 361 TensorFlow 362 TensorFlow: Google’s Framework 362 TensorFlow: Data Flow Graph 362 Start Programming with TensorFlow 363 Installing TensorFlow 363 Programming with the IPython QtConsole 364 The Model and Sessions in TensorFlow 364 Tensors 366 Operation on Tensors 370 Single Layer Perceptron with TensorFlow 371 Before Starting 372 Data To Be Analyzed 372 The SLP Model Definition 374 Learning Phase 378 Test Phase and Accuracy Calculation 383 Multi Layer Perceptron (with One Hidden Layer) with TensorFlow 386 The MLP Model Definition 387 Learning Phase 389 Test Phase and Accuracy Calculation 395 Multi Layer Perceptron (with Two Hidden Layers) with TensorFlow 397 Test Phase and Accuracy Calculation 402 Evaluation of Experimental Data 404 Conclusions 407 Chapter 10: An Example— Meteorological Data 409 A Hypothesis to Be Tested: The Influence of the Proximity of the Sea 409 The System in the Study: The Adriatic Sea and the Po Valley 410 Finding the Data Source 414 Data Analysis on Jupyter Notebook 415 Analysis of Processed Meteorological Data 421 xiii Table of Contents The RoseWind 436 Calculating the Mean Distribution of the Wind Speed 441 Conclusions 443 Chapter 11: Embedding the JavaScript D3 Library in the IPython Notebook 445 The Open Data Source for Demographics 445 The JavaScript D3 Library 449 Drawing a Clustered Bar Chart 454 The Choropleth Maps 459 The Choropleth Map of the U.S. Population in 2014 464 Conclusions 471 Chapter 12: Recognizing Handwritten Digits 473 Handwriting Recognition 473 Recognizing Handwritten Digits with scikit-learn 474 The Digits Dataset 475 Learning and Predicting 478 Recognizing Handwritten Digits with TensorFlow 480 Learning and Predicting 482 Conclusions 486 Chapter 13: Textual Data Analysis with NLTK 487 Text Analysis Techniques 487 The Natural Language Toolkit (NLTK) 488 Import the NLTK Library and the NLTK Downloader Tool 489 Search for a Word with NLTK 493 Analyze the Frequency of Words 494 Selection of Words from Text 497 Bigrams and Collocations 498 Use Text on the Network 500 Extract the Text from the HTML Pages 501 Sentimental Analysis 502 Conclusions 506 xiv Table of Contents Chapter 14: Image Analysis and Computer Vision with OpenCV 507 Image Analysis and Computer Vision 507 OpenCV and Python 508 OpenCV and Deep Learning 509 Installing OpenCV 509 First Approaches to Image Processing and Analysis 509 Before Starting 510 Load and Display an Image 510 Working with Images 512 Save the New Image 514 Elementary Operations on Images 514 Image Blending 520 Image Analysis 521 Edge Detection and Image Gradient Analysis 522 Edge Detection 522 The Image Gradient Theory 523 A Practical Example of Edge Detection with the Image Gradient Analysis 525 A Deep Learning Example: The Face Detection 532 Conclusions 535 Appendix A: Writing Mathematical Expressions with LaTeX 537 W ith matplotlib 537 With IPython Notebook in a Markdown Cell 537 With IPython Notebook in a Python 2 Cell 538 Subscripts and Superscripts 538 Fractions, Binomials, and Stacked Numbers 538 Radicals 539 Fonts 539 Accents 540 xv Table of Contents Appendix B: Open Data Sources 549 Political and Government Data 549 Health Data 550 Social Data 550 Miscellaneous and Public Data Sets 551 Financial Data 552 Climatic Data 552 Sports Data 553 Publications, Newspapers, and Books 553 Musical Data 553 Index 555 xvi About the Author Fabio Nelli is a data scientist and Python consultant, designing and developing Python applications for data analysis and visualization. He has experience with the scientific world, having performed various data analysis roles in pharmaceutical chemistry for private research companies and universities. He has been a computer consultant for many years at IBM, EDS, and Hewlett-Packard, along with several banks and insurance companies. He has an organic chemistry master’s degree and a bachelor’s degree in information technologies and automation systems, with many years of experience in life sciences (as as Tech Specialist at Beckman Coulter, Tecan, Sciex). For further info and other examples, visit his page at https://www.meccanismocomplesso. org and the GitHub page https://github.com/meccanismocomplesso. xvii About the Technical Reviewer Raul Samayoa is a senior software developer and machine learning specialist with many years of experience in the financial industry. An MSc graduate from the Georgia Institute of Technology, he's never met a neural network or dataset he did not like. He's fond of evangelizing the use of DevOps tools for data science and software development. Raul enjoys the energy of his hometown of Toronto, Canada, where he runs marathons, volunteers as a technology instructor with the University of Toronto coders, and likes to work with data in Python and R. xix CHAPTER 1 An Introduction to Data Analysis In this chapter, you begin to take the first steps in the world of data analysis, learning in detail about all the concepts and processes that make up this discipline. The concepts discussed in this chapter are helpful background for the following chapters, where these concepts and procedures will be applied in the form of Python code, through the use of several libraries that will be discussed in just as many chapters. D ata Analysis In a world increasingly centralized around information technology, huge amounts of data are produced and stored each day. Often these data come from automatic detection systems, sensors, and scientific instrumentation, or you produce them daily and unconsciously every time you make a withdrawal from the bank or make a purchase, when you record various blogs, or even when you post on social networks. But what are the data? The data actually are not information, at least in terms of their form. In the formless stream of bytes, at first glance it is difficult to understand their essence if not strictly the number, word, or time that they report. Information is actually the result of processing, which, taking into account a certain dataset, extracts some conclusions that can be used in various ways. This process of extracting information from raw data is called data analysis. The purpose of data analysis is to extract information that is not easily deducible but that, when understood, leads to the possibility of carrying out studies on the mechanisms of the systems that have produced them, thus allowing you to forecast possible responses of these systems and their evolution in time. 1 © Fabio Nelli 2018 F. Nelli, Python Data Analytics, https://doi.org/10.1007/978-1-4842-3913-1_1 Chapter 1 An Introduction to Data Analysis Starting from a simple methodical approach on data protection, data analysis has become a real discipline, leading to the development of real methodologies generating models. The model is in fact the translation into a mathematical form of a system placed under study. Once there is a mathematical or logical form that can describe system responses under different levels of precision, you can then make predictions about its development or response to certain inputs. Thus the aim of data analysis is not the model, but the quality of its predictive power. The predictive power of a model depends not only on the quality of the modeling techniques but also on the ability to choose a good dataset upon which to build the entire data analysis process. So the search for data, their extraction, and their subsequent preparation, while representing preliminary activities of an analysis, also belong to data analysis itself, because of their importance in the success of the results. So far we have spoken of data, their handling, and their processing through calculation procedures. In parallel to all stages of processing of data analysis, various methods of data visualization have been developed. In fact, to understand the data, both individually and in terms of the role they play in the entire dataset, there is no better system than to develop the techniques of graphic representation capable of transforming information, sometimes implicitly hidden, in figures, which help you more easily understand their meaning. Over the years lots of display modes have been developed for different modes of data display: the charts. At the end of the data analysis process, you will have a model and a set of graphical displays and then you will be able to predict the responses of the system under study; after that, you will move to the test phase. The model will be tested using another set of data for which you know the system response. These data are, however, not used to define the predictive model. Depending on the ability of the model to replicate real observed responses, you will have an error calculation and knowledge of the validity of the model and its operating limits. These results can be compared with any other models to understand if the newly created one is more efficient than the existing ones. Once you have assessed that, you can move to the last phase of data analysis—deployment. This consists of implementing the results produced by the analysis, namely, implementing the decisions to be taken based on the predictions generated by the model and the associated risks. Data analysis is well suited to many professional activities. So, knowledge of it and how it can be put into practice is relevant. It allows you to test hypotheses and to understand more deeply the systems analyzed. 2 Chapter 1 An Introduction to Data Analysis Knowledge Domains of the Data Analyst Data analysis is basically a discipline suitable to the study of problems that may occur in several fields of applications. Moreover, data analysis includes many tools and methodologies that require good knowledge of computing, mathematical, and statistical concepts. A good data analyst must be able to move and act in many different disciplinary areas. Many of these disciplines are the basis of the methods of data analysis, and proficiency in them is almost necessary. Knowledge of other disciplines is necessary depending on the area of application and study of the particular data analysis project you are about to undertake, and, more generally, sufficient experience in these areas can help you better understand the issues and the type of data needed. Often, regarding major problems of data analysis, it is necessary to have an interdisciplinary team of experts who can contribute in the best possible way in their respective fields of competence. Regarding smaller problems, a good analyst must be able to recognize problems that arise during data analysis, inquire to determine which disciplines and skills are necessary to solve these problems, study these disciplines, and maybe even ask the most knowledgeable people in the sector. In short, the analyst must be able to know how to search not only for data, but also for information on how to treat that data. Computer Science Knowledge of computer science is a basic requirement for any data analyst. In fact, only when you have good knowledge of and experience in computer science can you efficiently manage the necessary tools for data analysis. In fact, every step concerning data analysis involves using calculation software (such as IDL, MATLAB, etc.) and programming languages (such as C ++, Java, and Python). The large amount of data available today, thanks to information technology, requires specific skills in order to be managed as efficiently as possible. Indeed, data research and extraction require knowledge of these various formats. The data are structured and stored in files or database tables with particular formats. XML, JSON, or simply XLS or CSV files, are now the common formats for storing and collecting data, and many applications allow you to read and manage the data stored on them. When it comes to extracting data contained in a database, things are not so immediate, but you need to know the SQL query language or use software specially developed for the extraction of data from a given database. 3 Chapter 1 An Introduction to Data Analysis Moreover, for some specific types of data research, the data are not available in an explicit format, but are present in text files (documents and log files) or web pages, and shown as charts, measures, number of visitors, or HTML tables. This requires specific technical expertise for the parsing and the eventual extraction of these data (called web scraping). So, knowledge of information technology is necessary to know how to use the various tools made available by contemporary computer science, such as applications and programming languages. These tools, in turn, are needed to perform data analysis and data visualization. The purpose of this book is to provide all the necessary knowledge, as far as possible, regarding the development of methodologies for data analysis. The book uses the Python programming language and specialized libraries that provide a decisive contribution to the performance of all the steps constituting data analysis, from data research to data mining, to publishing the results of the predictive model. Mathematics and Statistics As you will see throughout the book, data analysis requires a lot of complex math during the treatment and processing of data. You need to be competent in all of this, at least to understand what you are doing. Some familiarity with the main statistical concepts is also necessary because all the methods that are applied in the analysis and interpretation of data are based on these concepts. Just as you can say that computer science gives you the tools for data analysis, so you can say that the statistics provide the concepts that form the basis of data analysis. This discipline provides many tools to the analyst, and a good knowledge of how to best use them requires years of experience. Among the most commonly used statistical techniques in data analysis are Bayesian methods Regression Clustering Having to deal with these cases, you’ll discover how mathematics and statistics are closely related. Thanks to the special Python libraries covered in this book, you will be able to manage and handle them. 4 Chapter 1 An Introduction to Data Analysis Machine Learning and Artificial Intelligence One of the most advanced tools that falls in the data analysis camp is machine learning. In fact, despite the data visualization and techniques such as clustering and regression, which should help you find information about the dataset, during this phase of research, you may often prefer to use special procedures that are highly specialized in searching patterns within the dataset. Machine learning is a discipline that uses a whole series of procedures and algorithms that analyze the data in order to recognize patterns, clusters, or trends and then extracts useful information for data analysis in an automated way. This discipline is increasingly becoming a fundamental tool of data analysis, and thus knowledge of it, at least in general, is of fundamental importance to the data analyst. Professional Fields of Application Another very important point is the domain of competence of the data (its source—biology, physics, finance, materials testing, statistics on population, etc.). In fact, although analysts have had specialized preparation in the field of statistics, they must also be able to document the source of the data, with the aim of perceiving and better understanding the mechanisms that generated the data. In fact, the data are not simple strings or numbers; they are the expression, or rather the measure, of any parameter observed. Thus, better understanding where the data came from can improve their interpretation. Often, however, this is too costly for data analysts, even ones with the best intentions, and so it is good practice to find consultants or key figures to whom you can pose the right questions. Understanding the Nature of the Data The object of study of data analysis is basically the data. The data then will be the key player in all processes of data analysis. The data constitute the raw material to be processed, and thanks to their processing and analysis, it is possible to extract a variety of information in order to increase the level of knowledge of the system under study, that is, one from which the data came. 5 Chapter 1 An Introduction to Data Analysis When the Data Become Information Data are the events recorded in the world. Anything that can be measured or categorized can be converted into data. Once collected, these data can be studied and analyzed, both to understand the nature of the events and very often also to make predictions or at least to make informed decisions. When the Information Becomes Knowledge You can speak of knowledge when the information is converted into a set of rules that helps you better understand certain mechanisms and therefore make predictions on the evolution of some events. Types of Data Data can be divided into two distinct categories: Categorical (nominal and ordinal) Numerical (discrete and continuous) Categorical data are values or observations that can be divided into groups or categories. There are two types of categorical values: nominal and ordinal. A nominal variable has no intrinsic order that is identified in its category. An ordinal variable instead has a predetermined order. Numerical data are values or observations that come from measurements. There are two types of numerical values: discrete and continuous numbers. Discrete values can be counted and are distinct and separated from each other. Continuous values, on the other hand, are values produced by measurements or observations that assume any value within a defined range. The Data Analysis Process Data analysis can be described as a process consisting of several steps in which the raw data are transformed and processed in order to produce data visualizations and make predictions thanks to a mathematical model based on the collected data. Then, data 6 Chapter 1 An Introduction to Data Analysis analysis is nothing more than a sequence of steps, each of which plays a key role in the subsequent ones. So, data analysis is schematized as a process chain consisting of the following sequence of stages: Problem definition Data extraction Data preparation - Data cleaning Data preparation - Data transformation Data exploration and visualization Predictive modeling Model validation/test Deploy - Visualization and interpretation of results Deploy - Deployment of the solution Figure 1-1 shows a schematic representation of all the processes involved in the data analysis. 7 Chapter 1 An Introduction to Data Analysis Figure 1-1. The data analysis process Problem Definition The process of data analysis actually begins long before the collection of raw data. In fact, data analysis always starts with a problem to be solved, which needs to be defined. The problem is defined only after you have focused the system you want to study; this may be a mechanism, an application, or a process in general. Generally this study can be in order to better understand its operation, but in particular the study will be designed to understand the principles of its behavior in order to be able to make predictions or choices (defined as an informed choice). The definition step and the corresponding documentation (deliverables) of the scientific problem or business are both very important in order to focus the entire analysis strictly on getting results. In fact, a comprehensive or exhaustive study of the 8 Chapter 1 An Introduction to Data Analysis system is sometimes complex and you do not always have enough information to start with. So the definition of the problem and especially its planning can determine the guidelines to follow for the whole project. Once the problem has been defined and documented, you can move to the project planning stage of data analysis. Planning is needed to understand which professionals and resources are necessary to meet the requirements to carry out the project as efficiently as possible. So you’re going to consider the issues in the area involving the resolution of the problem. You will look for specialists in various areas of interest and install the software needed to perform data analysis. Also during the planning phase, you choose an effective team. Generally, these teams should be cross-disciplinary in order to solve the problem by looking at the data from different perspectives. So, building a good team is certainly one of the key factors leading to success in data analysis. Data Extraction Once the problem has been defined, the first step is to obtain the data in order to perform the analysis. The data must be chosen with the basic purpose of building the predictive model, and so data selection is crucial for the success of the analysis as well. The sample data collected must reflect as much as possible the real world, that is, how the system responds to stimuli from the real world. For example, if you’re using huge datasets of raw data and they are not collected competently, these may portray false or unbalanced situations. Thus, poor choice of data, or even performing analysis on a dataset that’s not perfectly representative of the system, will lead to models that will move away from the system under study. The search and retrieval of data often require a form of intuition that goes beyond mere technical research and data extraction. This process also requires a careful understanding of the nature and form of the data, which only good experience and knowledge in the problem’s application field can provide. Regardless of the quality and quantity of data needed, another issue is using the best data sources. If the studio environment is a laboratory (technical or scientific) and the data generated are experimental, then in this case the data source is easily identifiable. In this case, the problems will be only concerning the experimental setup. 9 Chapter 1 An Introduction to Data Analysis But it is not possible for data analysis to reproduce systems in which data are gathered in a strictly experimental way in every field of application. Many fields require searching for data from the surrounding world, often relying on external experimental data, or even more often collecting them through interviews or surveys. So in these cases, finding a good data source that is able to provide all the information you need for data analysis can be quite challenging. Often it is necessary to retrieve data from multiple data sources to supplement any shortcomings, to identify any discrepancies, and to make the dataset as general as possible. When you want to get the data, a good place to start is the Web. But most of the data on the Web can be difficult to capture; in fact, not all data are available in a file or database, but might be content that is inside HTML pages in many different formats. To this end, a methodology called web scraping allows the collection of data through the recognition of specific occurrence of HTML tags within web pages. There is software specifically designed for this purpose, and once an occurrence is found, it extracts the desired data. Once the search is complete, you will get a list of data ready to be subjected to data analysis. Data Preparation Among all the steps involved in data analysis, data preparation, although seemingly less problematic, in fact requires more resources and more time to be completed. Data are often collected from different data sources, each of which will have data in it with a different representation and format. So, all of these data will have to be prepared for the process of data analysis. The preparation of the data is concerned with obtaining, cleaning, normalizing, and transforming data into an optimized dataset, that is, in a prepared format that’s normally tabular and is suitable for the methods of analysis that have been scheduled during the design phase. Many potential problems can arise, including invalid, ambiguous, or missing values, replicated fields, and out-of-range data. Data Exploration/Visualization Exploring the data involves essentially searching the data in a graphical or statistical presentation in order to find patterns, connections, and relationships. Data visualization is the best tool to highlight possible patterns. 10 Chapter 1 An Introduction to Data Analysis In recent years, data visualization has been developed to such an extent that it has become a real discipline in itself. In fact, numerous technologies are utilized exclusively to display data, and many display types are applied to extract the best possible information from a dataset. Data exploration consists of a preliminary examination of the data, which is important for understanding the type of information that has been collected and what it means. In combination with the information acquired during the definition problem, this categorization will determine which method of data analysis will be most suitable for arriving at a model definition. Generally, this phase, in addition to a detailed study of charts through the visualization data, may consist of one or more of the following activities: Summarizing data Grouping data Exploring the relationship between the various attributes Identifying patterns and trends Constructing regression models Constructing classification models Generally, data analysis requires summarizing statements regarding the data to be studied. Summarization is a process by which data are reduced to interpretation without sacrificing important information. Clustering is a method of data analysis that is used to find groups united by common attributes (also called grouping). Another important step of the analysis focuses on the identification of relationships, trends, and anomalies in the data. In order to find this kind of information, you often have to resort to the tools as well as perform another round of data analysis, this time on the data visualization itself. Other methods of data mining, such as decision trees and association rules, automatically extract important facts or rules from the data. These approaches can be used in parallel with data visualization to uncover relationships between the data. 11 Chapter 1 An Introduction to Data Analysis Predictive Modeling Predictive modeling is a process used in data analysis to create or choose a suitable statistical model to predict the probability of a result. After exploring the data, you have all the information needed to develop the mathematical model that encodes the relationship between the data. These models are useful for understanding the system under study, and in a specific way they are used for two main purposes. The first is to make predictions about the data values produced by the system; in this case, you will be dealing with regression models. The second purpose is to classify new data products, and in this case, you will be using classification models or clustering models. In fact, it is possible to divide the models according to the type of result they produce: Classification models: If the result obtained by the model type is categorical. Regression models: If the result obtained by the model type is numeric. Clustering models: If the result obtained by the model type is descriptive. Simple methods to generate these models include techniques such as linear regression, logistic regression, classification and regression trees, and k-nearest neighbors. But the methods of analysis are numerous, and each has specific characteristics that make it excellent for some types of data and analysis. Each of these methods will produce a specific model, and then their choice is relevant to the nature of the product model. Some of these models will provide values corresponding to the real system and according to their structure. They will explain some characteristics of the system under study in a simple and clear way. Other models will continue to give good predictions, but their structure will be no more than a “black box” with limited ability to explain characteristics of the system. 12 Chapter 1 An Introduction to Data Analysis Model Validation Validation of the model, that is, the test phase, is an important phase that allows you to validate the model built on the basis of starting data. That is important because it allows you to assess the validity of the data produced by the model by comparing them directly with the actual system. But this time, you are coming out from the set of starting data on which the entire analysis has been established. Generally, you will refer to the data as the training set when you are using them for building the model, and as the validation set when you are using them for validating the model. Thus, by comparing the data produced by the model with those produced by the system, you will be able to evaluate the error, and using different test datasets, you can estimate the limits of validity of the generated model. In fact the correctly predicted values could be valid only within a certain range, or have different levels of matching depending on the range of values taken into account. This process allows you not only to numerically evaluate the effectiveness of the model but also to compare it with any other existing models. There are several techniques in this regard; the most famous is the cross-validation. This technique is based on the division of the training set into different parts. Each of these parts, in turn, will be used as the validation set and any other as the training set. In this iterative manner, you will have an increasingly perfected model. Deployment This is the final step of the analysis process, which aims to present the results, that is, the conclusions of the analysis. In the deployment process of the business environment, the analysis is translated into a benefit for the client who has commissioned it. In technical or scientific environments, it is translated into design solutions or scientific publications. That is, the deployment basically consists of putting into practice the results obtained from the data analysis. There are several ways to deploy the results of data analysis or data mining. Normally, a data analyst’s deployment consists in writing a report for management or for the customer who requested the analysis. This document will conceptually describe the results obtained from the analysis of data. The report should be directed to the managers, who are then able to make decisions. Then, they will put into practice the conclusions of the analysis. 13 Chapter 1 An Introduction to Data Analysis In the documentation supplied by the analyst, each of these four topics will be discussed in detail: Analysis results Decision deployment Risk analysis Measuring the business impact When the results of the project include the generation of predictive models, these models can be deployed as stand-alone applications or can be integrated into other software. Quantitative and Qualitative Data Analysis Data analysis is completely focused on data. Depending on the nature of the data, it is possible to make some distinctions. When the analyzed data have a strictly numerical or categorical structure, then you are talking about quantitative analysis, but when you are dealing with values that are expressed through descriptions in natural language, then you are talking about qualitative analysis. Precisely because of the different nature of the data processed by the two types of analyses, you can observe some differences between them. Quantitative analysis has to do with data with a logical order or that can be categorized in some way. This leads to the formation of structures within the data. The order, categorization, and structures in turn provide more information and allow further processing of the data in a more mathematical way. This leads to the generation of models that provide quantitative predictions, thus allowing the data analyst to draw more objective conclusions. Qualitative analysis instead has to do with data that generally do not have a structure, at least not one that is evident, and their nature is neither numeric nor categorical. For example, data under qualitative study could include written textual, visual, or audio data. This type of analysis must therefore be based on methodologies, often ad hoc, to extract information that will generally lead to models capable of providing qualitative predictions, with the result that the conclusions to which the data analyst can arrive may also include subjective interpretations. On the other hand, qualitative analysis 14 Chapter 1 An Introduction to Data Analysis can explore more complex systems and draw conclusions that are not possible using a strictly mathematical approach. Often this type of analysis involves the study of systems such as social phenomena or complex structures that are not easily measurable. Figure 1-2 shows the differences between the two types of analysis. Figure 1-2. Quantitative and qualitative analyses O pen Data In support of the growing demand for data, a huge number of data sources are now available on the Internet. These data sources freely provide information to anyone in need, and they are called open data. Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B. DataHub (http://datahub.io/dataset) World Health Organization (http://www.who.int/research/en/) Data.gov (http://data.gov) European Union Open Data Portal (http://open-data.europa.eu/ en/data/) Amazon Web Service public datasets (http://aws.amazon.com/ datasets) Facebook Graph (http://developers.facebook.com/docs/graph-api) 15 Chapter 1 An Introduction to Data Analysis Healthdata.gov (http://www.healthdata.gov) Google Trends (http://www.google.com/trends/explore) Google Finance (https://www.google.com/finance) Google Books Ngrams (http://storage.googleapis.com/books/ ngrams/books/datasetsv2.html) Machine Learning Repository (http://archive.ics.uci.edu/ml/) As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lod-cloud.net), which displays the connections of the data link among several open data sources currently available on the network (see Figure 1-3). Figure 1-3. Linking open data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch, and Richard Cyganiak. http://lod-cloud.net/ [CC-BY-SA license] 16 Chapter 1 An Introduction to Data Analysis Python and Data Analysis The main argument of this book is to develop all the concepts of data analysis by treating them in terms of Python. The Python programming language is widely used in scientific circles because of its large number of libraries that provide a complete set of tools for analysis and data manipulation. Compared to other programming languages generally used for data analysis, such as R and MATLAB, Python not only provides a platform for processing data, but also has features that make it unique compared to other languages and specialized applications. The development of an ever-increasing number of support libraries, the implementation of algorithms of more innovative methodologies, and the ability to interface with other programming languages (C and Fortran) all make Python unique among its kind. Furthermore, Python is not only specialized for data analysis, but also has many other applications, such as generic programming, scripting, interfacing to databases, and more recently web development, thanks to web frameworks like Django. So it is possible to develop data analysis projects that are compatible with the web server with the possibility to integrate it on the Web. So, for those who want to perform data analysis, Python, with all its packages, is considered the best choice for the foreseeable future. Conclusions In this chapter, you learned what data analysis is and, more specifically, the various processes that comprise it. Also, you have begun to see the role that data play in building a prediction model and how their careful selection is at the basis of a careful and accurate data analysis. In the next chapter, you will take this vision of Python and the tools it provides to perform data analysis. 17 CHAPTER 2 Introduction to the Python World The Python language, and the world around it, is made by interpreters, tools, editors, libraries, notebooks, etc. This Python world has expanded greatly in recent years, enriching and taking forms that developers who approach it for the first time can sometimes find complicated and somewhat misleading. Thus if you are approaching Python for the first time, you might feel lost among so many choices, especially on where to start. This chapter gives you an overview of the entire Python world. First you will read a description of the Python language and its unique characteristics. You’ll see where to start, what an interpreter is, and how to begin writing the first lines of code in Python. Then you are presented with some new, more advanced, forms of interactive writing with respect to the shells, such as IPython and IPython Notebook. Python—The Programming Language The Python programming language was created by Guido Von Rossum in 1991 and started with a previous language called ABC. This language can be characterized by a series of adjectives: Interpreted Portable Object-oriented Interactive Interfaced Open source Easy to understand and use 19 © Fabio Nelli 2018 F. Nelli, Python Data Analytics, https://doi.org/10.1007/978-1-4842-3913-1_2 Chapter 2 Introduction to the Python World Python is an interpreted programming language, that is, it’s pseudo-compiled. Once you write the code, you need an interpreter to run it. The interpreter is a program that is installed on each machine that has the task of interpreting the source code and running it. Unlike with languages such as C, C++, and Java, there is no compile time with Python. Python is a highly portable programming language. The decision to use an interpreter as an interface for reading and running code has a key advantage: portability. In fact, you can install an interpreter on any platform (Linux, Windows, and Mac) and the Python code will not change. Because of this, Python is often used as the programming language for many small-form devices, such as the Raspberry Pi and other microcontrollers. Python is an object-oriented programming language. In fact, it allows you to specify classes of objects and implement their inheritance. But unlike C++ and Java, there are no constructors or destructors. Python also allows you to implement specific constructs in your code to manage exceptions. However, the structure of the language is so flexible that it allows you to program with alternative approaches with respect to the object-oriented one. For example, you can use functional or vectorial approaches. Python is an interactive programming language. Thanks to the fact that Python uses an interpreter to be executed, this language can take on very different aspects depending on the context in which it is used. In fact, you can write code made of a lot of lines, similar to what you might do in languages like C++ or Java, and then launch the program, or you can enter the command line at once and execute it, immediately getting the results of the command. Then, depending on the results, you can decide what command to run next. This highly interactive way to execute code makes the Python computing environment similar to MATLAB. This feature of Python is one reason it’s popular with the scientific community. Python is a programming language that can be interfaced. In fact, this programming language can be interfaced with code written in other programming languages such as C/C++ and FORTRAN. Even this was a winning choice. In fact, thanks to this aspect, Python can compensate for what is perhaps its only weak point, the speed of execution. The nature of Python, as a highly dynamic programming language, can sometimes lead to execution of programs up to 100 times slower than the corresponding static programs compiled with other languages. Thus the solution to this kind of performance problem is to interface Python to the compiled code of other languages by using it as if it were its own. 20 Chapter 2 Introduction to the Python World Python is an open-source programming language. CPython, which is the reference implementation of the Python language, is completely free and open source. Additionally every module or library in the network is open source and their code is available online. Every month, an extensive developer community includes improvements to make this language and all its libraries even richer and more efficient. CPython is managed by the nonprofit Python Software Foundation, which was created in 2001 and has given itself the task of promoting, protecting, and advancing the Python programming language. Finally, Python is a simple language to use and learn. This aspect is perhaps the most important, because it is the most direct aspect that a developer, even a novice, faces. The high intuitiveness and ease of reading of Python code often leads to “sympathy” for this programming language, and consequently it is the choice of most newcomers to programming. However, its simplicity does not mean narrowness, since Python is a language that is spreading in every field of computing. Furthermore, Python is doing all of this so simply, in comparison to existing programming languages such as C++, Java, and FORTRAN, which by their nature are very complex. P ython—The Interpreter As described in the previous sections, each time you run the python command, the Python interpreter starts, characterized by a >>> prompt. The Python interpreter is simply a program that reads and interprets the commands passed to the prompt. You have seen that the interpreter can accept either a single command at a time or entire files of Python code. However the approach by which it performs this is always the same. Each time you press the Enter key, the interpreter begins to scan the code (either a row or a full file of code) token by token (called tokenization). These tokens are fragments of text that the interpreter arranges in a tree structure. The tree obtained is the logical structure of the program, which is then converted to bytecode (.pyc or.pyo). The process chain ends with the bytecode that will be executed by a Python virtual machine (PVM). See Figure 2-1. 21 Chapter 2 Introduction to the Python World Figure 2-1. The steps performed by the Python interpreter You can find very good documentation on this process at https://www.ics.uci. edu/~pattis/ICS-31/lectures/tokens.pdf. The standard Python interpreter is reported as Cython, since it was written in C. There are other areas that have been developed using other programming languages, such as Jython, developed in Java; IronPython, developed in C# (only for Windows); and PyPy, developed entirely in Python. Cython The Cython project is based on creating a compiler that translates Python code into C. This code is then executed within a Cython environment at runtime. This type of compilation system has made it possible to introduce C semantics into the Python code to make it even more efficient. This system has led to the merging of two worlds of programming language with the birth of Cython, which can be considered a new programming language. You can find a lot of documentation about it online; I advise you to visit http://docs.cython.org. Jython In parallel to Cython, there is a version totally built and compiled in Java, named Jython. It was created by Jim Hugunin in 1997 (http://www.jython.org). Jython is an implementation of the Python programming language in Java; it is further characterized by using Java classes instead of Python modules to implement extensions and packages of Python. PyPy The PyPy interpreter is a JIT (just-in-time) compiler, and it converts the Python code directly in machine code at runtime. This choice was made to speed up the execution of Python. However, this choice has led to the use of a smaller subset of Python commands, defined as RPython. For more information on this, consult the official website at http://pypy.org. 22 Chapter 2 Introduction to the Python World Python 2 and Python 3 The Python community is still in transition from interpreters of the Series 2 to Series 3. In fact, you will currently find two releases of Python that are used in parallel (version 2.7 and version 3.6). This kind of ambiguity can create confusion, especially in terms of choosing which version to use and the differences between these two versions. One question that you surely must be asking is why version 2.x is still being released if it is distributed around a much more enhanced version such as 3.x. When Guido Van Rossum (the creator of Python) decided to bring significant changes to the Python language, he soon found that these changes would make the new version incompatible with a lot of existing code. Thus he decided to start with a new version of Python called Python 3.0. To overcome the problem of incompatibility and avoid creating huge amounts of unusable code, it was decided to maintain a compatible version, 2.7 to be precise. Python 3.0 made its first appearance in 2008, while version 2.7 was released in 2010 with a promise that it would not be followed by big releases, and at the moment the current version is 3.6.5 (2018). In the book we refer to the Python 3.x version; however, with a few exceptions, there should be no problem with the Python 2.7.x version (the last version is 2.7.14 and was released in September 2017). Installing Python In order to develop programs in Python you have to install it on your operating system. Linux distributions and MacOS X machines should already have a preinstalled version of Python. If not, or if you would like to replace that version with another, you can easily install it. The installation of Python differs from operating system to operating system; however, it is a rather simple operation. On Debian-Ubuntu Linux systems, run this command apt-get install python On Red Hat Fedora Linux systems working with rpm packages, run this command yum install python 23 Chapter 2 Introduction to the Python World If you are running Windows or MacOS X, you can go to the official Python site (http://www.python.org) and download the version you prefer. The packages in this case are installed automatically. However, today there are distributions that provide a number of tools that make the management and installation of Python, all libraries, and associated applications easier. I strongly recommend you choose one of the distributions available online. P ython Distributions Due to the success of the Python programming language, many Python tools have been developed to meet various functionalities over the years. There are so many that it’s virtually impossible to manage all of them manually. In this regard, many Python distributions efficiently manage hundreds of Python packages. In fact, instead of individually downloading the interpreter, which includes only the standard libraries, and then needing to individually install all the additional libraries, it is much easier to install a Python distribution. At the heart of these distributions are the package managers, which are nothing more than applications that automatically manage, install, upgrade, configure, and remove Python packages that are part of the distribution. Their functionality is very useful, since the user simply makes a request on a particular package (which could be an installation for example), and the package manager, usually via the Internet, performs the operation by analyzing the necessary version, alongside all dependencies with any other packages, and downloading them if they not present. A naconda Anaconda is a free distribution of Python packages distributed by Continuum Analytics (https://www.anaconda.com). This distribution supports Linux, Windows, and MacOS X operating systems. Anaconda, in addition to providing the latest packages released in the Python world, comes bundled with most of the tools you need to set up a Python development environment. Indeed, when you install the Anaconda distribution on your system, you can use many tools and applications described in this chapter, without worrying about having to install and manage each separately. The basic distribution includes Spyder as the IDE, IPython QtConsole, and Notebook. 24 Chapter 2 Introduction to the Python World The management of the entire Anaconda distribution is performed by an application called conda. This is the package manager and the environment manager of the Anaconda distribution and it handles all of the packages and their versions. conda install One of the most interesting aspects of this distribution is the ability to manage multiple development environments, each with its own version of Python. Indeed, when you install Anaconda, the Python version 2.7 is installed by default. All installed packages then will refer to that version. This is not a problem, because Anaconda offers the possibility to work simultaneously and independently with other Python versions by creating a new environment. You can create, for instance, an environment based on Python 3.6. conda create -n py36 python=3.6 anaconda This will generate a new Anaconda environment with all the packages related to the Python 3.6 version. This installation will not affect in any way the environment built with Python 2.7. Once it’s installed, you can activate the new environment by entering the following command. source activate py36 On Windows, use this instead: activate py36 C:\Users\Fabio>activate py36 (py36) C:\Users\Fabio> You can create as many versions of Python as you want; you need only to change the parameter passed with the python option in the conda create command. When you want to return to work with the original Python version, you have to use the following command: source deactivate On Windows, use this: (py36) C:\Users\Fabio>deactivate Deactivating environment "py36"... C:\Users\Fabio> 25 Chapter 2 Introduction to the Python World E nthought Canopy There is another distribution very similar to Anaconda and it is the Canopy distribution provided by Enthought, a company founded in 2001 and known for the SciPy project (https://www.enthought.com/products/canopy/). This distribution supports Linux, Windows, and MacOS X systems and it consists of a large amount of packages, tools, and applications managed by a package manager. The package manager of Canopy, as opposed to conda, is graphical. Unfortunately, only the basic version of this distribution, called Canopy Express, is free; in addition to the package normally distributed, it also includes IPython and an IDE of Canopy that has a special feature that is not present in other IDEs. It has embedded the IPython in order to use this environment as a window for testing and debugging code. P ython(x,y) Python(x,y) is a free distribution that works only on Windows and is downloadable from http://code.google.com/p/pythonxy/. This distribution uses Spyder as the IDE. U sing Python Python is rich but simple and very flexible. It allows expansion of your development activities in many areas of work (data analysis, scientific, graphic interfaces, etc.). Precisely for this reason, Python can be used in many different contexts, often according to the taste and ability of the developer. This section presents the various approaches to using Python in the course of the book. According to the various topics discussed in different chapters, these different approaches will be used specifically, as they will be more suited to the task at hand. P ython Shell The easiest way to approach the Python world is to open a session in the Python shell, which is a terminal running a command line. In fact, you can enter one command at a time and test its operation immediately. This mode makes clear the nature of the interpreter that underlies Python. In fact, the interpreter can read one command at a time, keeping the status of the variables specified in the previous lines, a behavior similar to that of MATLAB and other calculation software. 26 Chapter 2 Introduction to the Python World This approach is helpful when approaching Python the first time. You can test commands one at a time without having to write, edit, and run an entire program, which could be composed of many lines of code. This mode is also good for testing and debugging Python code one line at a time, or simply to make calculations. To start a session on the terminal, simply type this on the command line: >>> python Python 3.6.3 (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> Now the Python shell is active and the interpreter is ready to receive commands in Python. Start by entering the simplest of commands, but a classic for getting started with programming. >>> print("Hello World!") Hello World! Run an Entire Program The best way to become familiar with Python is to write an entire program and then run it from the terminal. First write a program using a simple text editor. For example, you can use the code shown in Listing 2-1 and save it as MyFirstProgram.py. Listing 2-1. MyFirstProgram.py myname = input("What is your name? ") print("Hi " + myname + ", I'm glad to say: Hello world!") Now you’ve written your first program in Python, and you can run it directly from the command line by calling the python command and then the name of the file containing the program code. python MyFirstProgram.py What is your name? Fabio Nelli Hi Fabio Nelli, I'm glad to say: Hello world! 27 Chapter 2 Introduction to the Python World Implement the Code Using an IDE A more comprehensive approach than the previous ones is the use of an IDE (an Integrated Development Environment). These editors provide a work environment on which to develop your Python code. They are rich in tools that make developers’ lives easier, especially when debugging. In the following sections, you will see in detail what IDEs are currently available. Interact with Python The last approach, and in my opinion, perhaps the most innovative, is the interactive one. In fact, in addition to the three previous approaches, this approach provides you the opportunity to interact directly with the Python code. In this regard, the Python world has been greatly enriched with the introduction of IPython. IPython is a very powerful tool, designed specifically to meet the needs of interacting between the Python interpreter and the developer, which under this approach takes the role of analyst, engineer, or researcher. IPython and its features are explained in more detail in a later section. Writing Python Code In the previous section you saw how to write a simple program in which the string "Hello World" was printed. Now in this section you will get a brief overview of the basics of the Python language. This section is not intended to teach you to program in Python, or to illustrate syntax rules of the programming language, but just to give you a quick overview of some basic principles of Python necessary to continue with the topics covered in this book. If you already know the Python language, you can safely skip this introductory section. Instead if you are not familiar with programming and you find it difficult to understand the topics, I highly recommend that you visit online documentation, tutorials, and courses of various kinds. 28 Chapter 2 Introduction to the Python World Make Calculations You have already seen that the print() function is useful for printing almost anything. Python, in addition to being a printing tool, is also a great calculator. Start a session on the Python shell and begin to perform these mathematical operations: >>> 1 + 2 3 >>> (1.045 * 3)/4 0.78375 >>> 4 ** 2 16 >>> ((4 + 5j) * (2 + 3j)) (-7+22j) >>> 4 < (2*3) True Python can calculate many types of data including complex numbers and conditions with Boolean values. As you can see from these calculations, the Python interpreter directly returns the result of the calculations without the need to use the print() function. The same thing applies to values contained in variables. It’s enough to call the variable to see its contents. >>> a = 12 * 3.4 >>> a 40.8 Import New Libraries and Functions You saw that Python is characterized by the ability to extend its functionality by importing numerous packages and modules. To import a module in its entirety, you have to use the import command. >>> import math 29 Chapter 2 Introduction to the Python World In this way all the functions contained in the math package are available in your Python session so you can call them directly. Thus you have extended the standard set of functions available when you start a Python session. These functions are called with the following expression. library_name.function_name() For example, you can now calculate the sine of the value contained in the variable a. >>> math.sin(a) As you can see, the function is called along with the name of the library. Sometimes you might find the following expression for declaring an import. >>> from math import * Even if this works properly, it is to be avoided for good practice. In fact, writing an import in this way involves the importation of all functions without necessarily defining the library to which they belong. >>> sin(a) 0.040693257349864856 This form of import can lead to very large errors, especially if the imported libraries are numerous. In fact, it is not unlikely that different libraries have functions with the same name, and importing all of these would result in an override of all functions with the same name previously imported. Therefore the behavior of the program could generate numerous errors or worse, abnormal behavior. Actually, this way to import is generally used for only a limited number of functions, that is, functions that are strictly necessary for the functioning of the program, thus avoiding the importation of an entire library when it is completely unnecessary. >>> from math import sin Data Structure You saw in the previous examples how to use simple variables containing a single value. Python provides a number of extremely useful data structures. These data structures are able to contain lots of data simultaneously and sometimes even data of different types. 30 Chapter 2 Introduction to the Python World The various data structures provided are defined differently depending on how their data are structured internally. List Set Strings Tuples Dictionary Deque Heap This is only a small part of all the data structures that can be made with Python. Among all these data structures, the most commonly used are dictionaries and lists. The type dictionary, defined also as dicts, is a data structure in which each particular value is associated with a particular label, called a key. The data collected in a dictionary have no internal order but are only definitions of key/value pairs. >>> dict = {'name':'William', 'age':25, 'city':'London'} If you want to access a specific value within the dictionary, you have to indicate the name of the associated key. >>> dict["name"] 'William' If you want to iterate the pairs of values in a dictionary, you have to use the for-in construct. This is possible through the use of the items() function. >>> for key, value in dict.items():... print(key,value)... name William age 25 city London 31 Chapter 2 Introduction to the Python World The type list is a data structure that contains a number of objects in a precise order to form a sequence to which elements can be added and removed. Each item is marked with a number corresponding to the order of the sequence, called the index. >>> list = [1,2,3,4] >>> list [1, 2, 3, 4] If you want to access the individual elements, it is sufficient to specify the index in square brackets (the first item in the list has 0 as its index), while if you take out a portion of the list (or a sequence), it is sufficient to specify the range with the indices i and j corresponding to the extremes of the portion. >>> list 3 >>> list[1:3] [2, 3] If you are using negative indices instead, this means you are considering the last item in the list and gradually moving to the first. >>> list[-1] 4 In order to do a scan of the elements of a list, you can use the for-in construct. >>> items = [1,2,3,4,5] >>> for item in items:... print(item + 1)... 2 3 4 5 6 32 Chapter 2 Introduction to the Python World Functional Programming The for-in loop shown in the previous example is very similar to loops found in other programming languages. But actually, if you want to be a “Python” developer, you have to avoid using explicit loops. Python offers alternative approaches, specifying programming techniques such as functional programming (expression-oriented programming). The tools that Python provides to develop functional programming comprise a series of functions: map(function, list) filter(function, list) reduce(function, list) lambda list comprehension The for loop that you have just seen has a specific purpose, which is to apply an operation on each item and then somehow gather the result. This can be done by the map() function. >>> items = [1,2,3,4,5] >>> def inc(x): return x+1... >>> list(map(inc,items)) [2, 3, 4, 5, 6] In the previous example, it first defines the function that performs the operation on every single element, and then it passes it as the first argument to map(). Python allows you to define the function directly within the first argument using lambda as a function. This greatly reduces the code and compacts the previous construct into a single line of code. >>> list(map((lambda x: x+1),items)) [2, 3, 4, 5, 6] 33 Chapter 2 Introduction to the Python World Two other functions working in a similar way are filter() and reduce(). The filter() function extracts the elements of the list for which the function returns True. The reduce() function instead considers all the elements of the list to produce a single result. To use reduce(), you must import the module functools. >>> list(filter((lambda x: x < 4), items)) [1, 2, 3] >>> from functools import reduce >>> reduce((lambda x,y: x/y), items) 0.008333333333333333 Both of these functions implement other types by using the for loop. They replace these cycles and their functionality, which can be alternatively expressed with simple functions. That is what constitutes functional programming. The final concept of functional programming is list comprehension. This concept is used to build lists in a very natural and simple way, referring to them in a manner similar to how mathematicians describe datasets. The values in the sequence are defined through a particular function or operation. >>> S = [x**2 for x in range(5)] >>> S [0, 1, 4, 9, 16] Indentation A peculiarity for those coming from other programming languages is the role that indentation plays. Whereas you used to manage the indentation for purely aesthetic reasons, making the code somewhat more readable, in Python indentation assumes an integral role in the implementation of the code, by dividing it into logical blocks. In fact, while in Java, C, and C++, each line of code is separated from the next by a semicolon (;), in Python you should not specify any symbol that separates them, included the braces to indicate a logical block. These roles in Python are handled through indentation; that is, depending on the starting point of the code line, the interpreter determines whether it belongs to a logical block or not. 34 Chapter 2 Introduction to the Python World >>> a = 4 >>> if a > 3:... if a < 5:... print("I'm four")... else:... print("I'm a little number")... I'm four >>> if a > 3:... if a < 5:... print("I'm four")... else:... print("I'm a big number")... I'm four In this example you can see that depending on how the else command is indented, the conditions assume two different meanings (specified by me in the strings themselves). IPython IPython is a further development of Python that includes a number of tools: The IPython shell, which is a powerful interactive shell resulting in a greatly enhanced Python terminal. A QtConsole, which is a hybrid between a shell and a GUI, allowing you to display graphics inside the console instead of in separate windows. The IPython Notebook, which is a web interface that allows you to mix text, executable code, graphics, and formulas in a single representation. 35 Chapter 2 Introduction to the Python World IPython Shell This shell apparently resembles a Python session run from a command line, but actually, it provides many other features that make this shell much more powerful and versatile than the classic one. To launch this shell, just type ipython on the command line. > ipython Python 3.6.3 (default, Oct 15 2017, 3:27:45) [MSC v.1900 64bit (AMD64)] Type "copyright", "credits", or "license" for more information. IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help In : As you can see, a particular prompt appears with the value In. This means that it is the first line of input. Indeed, IPython offers a system of numbered prompts (indexed) with input and output caching. In : print("Hello World!") Hello World! In : 3/2 Out: 1.5 In : 5.0/2 Out: 2.5 In : The same thing applies to values in output that are indicated with the values Out, Out , and so on. IPython saves all inputs that you enter by storing them as variables. In fact, all the inputs entered were included as fields in a list called In. In : In Out: [", 'print "Hello World!"', '3/2', '5.0/2', 'In'] The indices of the list elements are the values that appear in each prompt. Thus, to access a single line of input, you can simply specify that value. In : In Out: '5.0/2' 36 Chapter 2 Introduction to the Python World For output, you can apply the same concept. In : Out Out: {2: 1, 3: 2.5, 4: ['', u'print "Hello World!"', u'3/2', u'5.0/2', u'_i2', u'In', u'In', u'Out'], 5: u'5.0/2'} The Jupyter Project IPython is a project that has grown enormously in recent times, and with the release of IPython 3.0, everything is moving toward a new project called Jupyter (https://jupyter.org)—see Figure 2-2. Figure 2-2. The Jupyter project logo 37 Chapter 2 Introduction to the Python World IPython will continue to exist as a Python shell and as a kernel of Jupyter, but the Notebook and the other language-agnostic components belonging to the IPython project will move to form the new Jupyter project. J upyter QtConsole In order to launch this application from the command line, you must enter the following command: ipython qtconsole or jupyter qtconsole The application consists of a GUI that has all the functionality present in the IPython shell. See Figure 2-3. Figure 2-3. The IPython QtConsole 38 Chapter 2 Introduction to the Python World J upyter Notebook Jupyter Notebook is the latest evolution of this interactive environment (see Figure 2-4). In fact, with Jupyter Notebook, you can merge executable code, text, formulas, images, and animations into a single Web document. This is useful for many purposes such as presentations, tutorials, debug, and so forth. Figure 2-4. The web page showing the Jupyter Notebook PyPI—The Python Package Index The Python Package Index (PyPI) is a software repository that contains all the software needed for programming in Python, for example, all Python packages belonging to other Python libraries. The content repository is managed directly by the developers of individual packages that deal with updating the repository with the latest versions of their released libraries. For a list of the packages contained in the repository, go to the official page of PyPI at https://pypi.python.org/pypi. As far as the administration of these packages, you can use the pip application, which is the package manager of PyPI. By launching it from the command line, you can manage all the packages and individually decide if a package should be installed, upgraded, or removed. Pip will check if the package is already installed, or if it needs to be updated, to control dependencies, and to assess whether other packages are necessary. Furthermore, it manages the downloading and installation processes. 39 Chapter 2 Introduction to the Python World $ pip install $ pip search $ pip show $ pip unistall Regarding the installation, if you have Python 3.4+ (released March 2014) and Python 2.7.9+ (released December 2014) already installed on your system, the pip software is already included in these releases of Python. However, if you are still using an older version of Python, you need to install pip on your system. The installation of pip on your system depends on the operating system on which you are working. On Linux Debian-Ubuntu, use this command: $ sudo apt-get install python-pip On Linux Fedora, use this command: $ sudo yum install python-pip On Windows, visit https://pip.pypa.io/en/latest/installing/ and download get-pip.py onto your PC. Once the file is downloaded, run this command: python get-pip.py This way, you will install the package manager. Remember to add C:\Python3.X\ Scripts to the PATH environment variable. The IDEs for Python Although most of the Python developers are used to implementing their code directly from the shell (Python or IPython), some IDEs (Interactive Development Environments) are also available. In fact, in addition to a text editor, these graphics editors also provide a series of tools that are very useful during the drafting of the code. For example, the auto-completion of code, viewing the documentation associated with the commands, debugging, and breakpoints are only some of the tools that this kind of application can provide. 40 Chapter 2 Introduction to the Python World S pyder Spyder (Scientific Python Development Environment) is an IDE that has similar features to the IDE of MATLAB (see Figure 2-5). The text editor is enriched with syntax highlighting and code analysis tools. Also, you can integrate ready-to-use widgets in your graphic applications. Figure 2-5. The Spyder IDE E clipse (pyDev) Those of you who have developed in other programming languages certainly know Eclipse, a universal IDE developed entirely in Java (therefore requiring Java installation on your PC) that provides a development environment for many programming languages (see Figure 2-6). There is also an Eclipse version for developing in Python, thanks to the installation of an additional plugin called pyDev. 41 Chapter 2 Introduction to the Python World Figure 2-6. The Eclipse IDE S ublime This text editor is one of the preferred environments for Python programmers (see Figure 2-7). In fact, there are several plugins available for this application that make Python implementation easy and enjoyable. 42 Chapter 2 Introduction to the Python World Figure 2-7. The Sublime IDE L iclipse Liclipse, similarly to Spyder, is a development environment specifically designed for the Python language (see Figure 2-8). It is very similar to the Eclipse IDE but it is fully adapted for a specific use in Python, without needing to install plugins like PyDev. So its installation and settings are much simpler than Eclipse. 43 Chapter 2 Introduction to the Python World Figure 2-8. The Liclipse IDE N injaIDE NinjaIDE (NinjaIDE is “Not Just Another IDE”), which characterized by a name that is a recursive acronym, is a specialized IDE for the Python language (see Figure 2-9). It’s a very recent application on which the efforts of many developers are focused. Already very promising, it is likely that in the coming years, this IDE will be a source of many surprises. 44 Chapter 2 Introduction to the Python World Figure 2-9. The Ninja IDE K omodo IDE Komodo is a very powerful IDE full of tools that make it a complete and professional development environment (see Figure 2-10). Paid software and written in C++, the Komodo development environment is adaptable to many programming languages, including Python. 45 Chapter 2 Introduction to the Python World Figure 2-10. The Komodo IDE SciPy SciPy (pronounced “sigh pie”) is a set of open-source Python libraries specialized for scientific computing. Many of these libraries are the protagonists of many chapters of the book, given that their knowledge is critical to data analysis. Together they constitute a set of tools for calculating and displaying data. It has little to envy from other specialized environments for calculation and data analysis (such as R or MATLAB). Among the libraries that are part of the SciPy group, there are three in particular that are discussed in the following chapters: NumPy matplotlib Pandas 46 Chapter 2 Introduction to the Python World N umPy This library, whose name means numerical Python, constitutes the core of many other Python libraries that have originated from it. Indeed, NumPy is the foundation library for scientific computing in Python since it provides data structures and high-performing functions that the basic package of the Python cannot provide. In fact, as you will see later in the book, NumPy defines a specific data structure that is an N-dimensional array defined as ndarray. Knowledge of this library is essential in terms of numerical calculations since its correct use can greatly influence the performance of your computations. Throughout the book, this library is almost omnipresent because of its unique characteristics, so an entire chapter is devoted to it (Chapter 3). This package provides some features that will be added to the standard Python: Ndarray: A multidimensional array much faster and more efficient than those provided by the basic package of Python. Element-wise computation: A set of functions for performing this type of calculation with arrays and mathematical operations between arrays. Reading-writing datasets: A set of tools for reading and writing data stored in the hard disk. Integration with other languages such as C, C++, and FORTRAN: A set of tools to integrate code developed with these programming languages. P andas This package provides complex data structures and functions specifically designed to make the work on them easy, fast, and effective. This package is the core of data analysis in Python. Therefore, the study and application of this package is the main goal on which you will work throughout the book (especially in Chapters 4, 5, and 6). Knowledge of its every detail, especially when it is applied to data analysis, is a fundamental objective of this book. The fundamental concept of this package is the DataFrame, a two-dimensional tabular data structure with row and column labels. 47 Chapter 2 Introduction to the Python World Pandas applies the high-performance properties of the NumPy library to the manipulation of data in spreadsheets or in relational databases (SQL databases). In fact, by using sophisticated indexing, it will be easy to carry out many operations on this kind of data structure, such as reshaping, slicing, aggregations, and the selection of subsets. m atplotlib This package is the Python library that is currently most popular for producing plots and other data visualizations in 2D. Since data analysis requires visualization tools, this is the library that best suits this purpose. In Chapter 7, you learn about this rich library in detail so you will know how to represent the results of your analysis in the best way. C onclusions During the course of this chapter, all the fundamental aspects characterizing the Python world have been illustrated. The basic concepts of the Python programming language were introduced, with brief examples explaining its innovative aspects and how it stands out compared to other programming languages. In addition, different ways of using Python at various levels were presented. First you saw how to use a simple command- line interpreter, then a set of simple graphical user interfaces were shown until you got to complex development environments, known as IDEs, such as Spyder, Liclipse, and NinjaIDE. Even the highly innovative project Jupyter (IPython) was presented, showing you how you can develop Python code interactively, in particular with the Jupyter Notebook. Moreover, the modular nature of Python was highlighted with the ability to expand the basic set of standard functions provided by Python’s external libraries. In this regard, the PyPI online repository was shown along with other Python distributions such as Anaconda and Enthought Canopy. In the next chapter, you deal with the first library that is the basis of all numerical calculations in Python: NumPy. You learn about the ndarray, a data structure which is the basis of the more complex data structures used in data analysis in the following chapters. 48 CHAPTER 3 The NumPy Library NumPy is a basic package for scientific computing with Python and especially for data analysis. In fact, this library is the basis of a large amount of mathematical and scientific Python packages, and among them, as you will see later in the book, the pandas library. This library, specialized for data analysis, is fully developed using the concepts introduced by NumPy. In fact, the built-in tools provided by the standard Python library could be too simple or inadequate for most of the calculations in data analysis. Having knowledge of the NumPy library is important to being able to use all scientific Python packages, and particularly, to use and understand the pandas library. The pandas library is the main subject of the following chapters. If you are already familiar with this library, you can proceed directly to the next chapter; otherwise you may see this chapter as a way to review the basic concepts or to regain familiarity with it by running the examples in this chapter. NumPy: A Little History At the dawn of the Python language, the developers needed to perform numerical calculations, especially when this language was being used by the scientific community. The first attempt was Numeric, developed by Jim Hugunin in 1995, which was followed by an alternative package called Numarray. Both packages were specialized for the calculation of arrays, and each had strengths depending on in which case they were used. Thus, they were used differently depending on the circumstances. This ambiguity led then to the idea of unifying the two packages. Travis Oliphant started to develop the NumPy library for this purpose. Its first release (v 1.0) occurred in 2006. From that moment on, NumPy proved to be the extension library of Python for scientific computing, and it is currently the most widely used package for the calculation of multidimensional arrays and large arrays. In addition, the package comes with a range of functions that allow you to perform operations on arrays in a highly efficient way and perform high-level mathematical calculations. 49 © Fabio Nelli 2018 F. Nelli, Python Data Analytics, https://doi.org/10.1007/978-1-4842-3913-1_3 Chapter 3 The NumPy Library Currently, NumPy is open source and licensed under BSD. There are many contributors who have expanded the potential of this library. The NumPy Installation Generally, this module is present as a basic package in most Python distributions; however, if not, you can install it later. On Linux (Ubuntu and Debian), use: sudo apt-get install python-numpy On Linux (Fedora) sudo yum install numpy scipy On Win

Python Data Analytics With Pandas, NumPy, and Matplotlib PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue