python_introduction_2021.pdf
Document Details
Uploaded by ExceptionalSun
Full Transcript
Introduction to Python for Econometrics, Statistics and Data Analysis 5th Edition Kevin Sheppard University of Oxford Monday 27th September, 2021 2 - ©2020 Kevin Sheppard Notes to the 5th Edition Python 3.8 or 3.9 are t...
Introduction to Python for Econometrics, Statistics and Data Analysis 5th Edition Kevin Sheppard University of Oxford Monday 27th September, 2021 2 - ©2020 Kevin Sheppard Notes to the 5th Edition Python 3.8 or 3.9 are the recommended versions. Added a chapter Code Style. Expanded coverage of random number generation and added coverage of the preferred method to gener- ate random variates, numpy.random.Generator in chapter Simulating Random Variables. Verified that all code and examples work correctly against 2020 versions of modules. The notable pack- ages and their versions are: – Python 3.9 (Preferred version), 3.7 (Minimum version) – NumPy: 1.20.3 – SciPy: 1.7.1 – pandas: 1.3.2 – matplotlib: 3.4.2 – statsmodels: 0.13.0 ii Notes to the 4th Edition Python 3.8 is the recommended version. The notes require Python 3.6 or later, and all references to Python 2.7 have been removed. Removed references to NumPy’s matrix class and clarified that it should not be used. Verified that all code and examples work correctly against 2020 versions of modules. The notable pack- ages and their versions are: – Python 3.8 (Preferred version), 3.6 (Minimum version) – NumPy: 1.19.1 – SciPy: 1.5.2 – pandas: 1.1.1 – matplotlib: 3.3.1 Introduced f-Strings in Section 23.3.3 as the preferred way to format strings using modern Python. The notes use f-String where possible instead of format. Added coverage of Windowing function – rolling, expanding and ewm – to the pandas chapter. Expanded the list of packages of interest to researchers working in statistics, econometrics and machine learning. Expanded description of model classes and statistical tests in statsmodels that are most relevant for econo- metrics. Added section detailing formula support. This list represents on a small function of the statsmod- els API. Added minimize as the preferred interface for non-linear function optimization in Chapter 22. Python 2.7 support has been officially dropped, although most examples continue to work with 2.7. Do not Python 2.7 for numerical code. Small typo fixes, thanks to Marton Huebler. Fixed direct download of FRED data due to API changes, thanks to Jesper Termansen. Thanks for Bill Tubbs for a detailed read and multiple typo reports. Updated to changes in line profiler (see Ch. 25) Updated deprecations in pandas. Removed hold from plotting chapter since this is no longer required. Thanks for Gen Li for multiple typo reports. iv Notes to the 3rd Edition This edition includes the following changes from the second edition (August 2014): Rewritten installation section focused exclusively on using Continuum’s Anaconda. Python 3.5 is the default version of Python instead of 2.7. Python 3.5 (or newer) is well supported by the Python packages required to analyze data and perform statistical analysis, and bring some new useful features, such as a new operator for matrix multiplication (@). Removed distinction between integers and longs in built-in data types chapter. This distinction is only relevant for Python 2.7. dot has been removed from most examples and replaced with @ to produce more readable code. Split Cython and Numba into separate chapters to highlight the improved capabilities of Numba. Verified all code working on current versions of core libraries using Python 3.5. pandas – Updated syntax of pandas functions such as resample. – Added pandas Categorical. – Expanded coverage of pandas groupby. – Expanded coverage of date and time data types and functions. New chapter introducing statsmodels, a package that facilitates statistical analysis of data. statsmodels includes regression analysis, Generalized Linear Models (GLM) and time-series analysis using ARIMA models. vi Changes since the Second Edition Fixed typos reported by a reader – thanks to Ilya Sorvachev Code verified against Anaconda 2.0.1. Added diagnostic tools and a simple method to use external code in the Cython section. Updated the Numba section to reflect recent changes. Fixed some typos in the chapter on Performance and Optimization. Added examples of joblib and IPython’s cluster to the chapter on running code in parallel. New chapter introducing object-oriented programming as a method to provide structure and organization to related code. Added seaborn to the recommended package list, and have included it be default in the graphics chapter. Based on experience teaching Python to economics students, the recommended installation has been simplified by removing the suggestion to use virtual environment. The discussion of virtual environments as been moved to the appendix. Rewrote parts of the pandas chapter. Changed the Anaconda install to use both create and install, which shows how to install additional pack- ages. Fixed some missing packages in the direct install. Changed the configuration of IPython to reflect best practices. Added subsection covering IPython profiles. Small section about Spyder as a good starting IDE. viii Notes to the 2nd Edition This edition includes the following changes from the first edition (March 2012): The preferred installation method is now Continuum Analytics’ Anaconda. Anaconda is a complete scientific stack and is available for all major platforms. New chapter on pandas. pandas provides a simple but powerful tool to manage data and perform prelim- inary analysis. It also greatly simplifies importing and exporting data. New chapter on advanced selection of elements from an array. Numba provides just-in-time compilation for numeric Python code which often produces large perfor- mance gains when pure NumPy solutions are not available (e.g. looping code). Dictionary, set and tuple comprehensions Numerous typos All code has been verified working against Anaconda 1.7.0. x Contents 1 Introduction 1 1.1 Background............................................ 1 1.2 Conventions........................................... 2 1.3 Important Components of the Python Scientific Stack..................... 3 1.4 Setup............................................... 4 1.5 Using Python........................................... 5 1.6 Exercises............................................. 11 1.A Additional Installation Issues.................................. 12 2 Built-in Data Types 15 2.1 Variable Names.......................................... 15 2.2 Core Native Data Types..................................... 16 2.3 Additional Container Data Types in the Standard Library................... 24 2.4 Python and Memory Management............................... 26 2.5 Exercises............................................. 27 3 Arrays 29 3.1 Array............................................... 29 3.2 1-dimensional Arrays....................................... 30 3.3 2-dimensional Arrays....................................... 31 3.4 Multidimensional Arrays..................................... 31 3.5 Concatenation.......................................... 32 3.6 Accessing Elements of an Array................................. 33 3.7 Slicing and Memory Management................................ 37 3.8 import and Modules...................................... 39 3.9 Calling Functions......................................... 40 3.10 Exercises............................................. 41 4 Basic Math 43 4.1 Operators............................................. 43 4.2 Broadcasting........................................... 43 4.3 Addition (+) and Subtraction (-)................................. 45 4.4 Multiplication (⁎)......................................... 45 4.5 Matrix Multiplication (@)..................................... 45 4.6 Array and Matrix Division (/)................................... 46 xii CONTENTS 4.7 Exponentiation (**)........................................ 46 4.8 Parentheses........................................... 46 4.9 Transpose............................................. 46 4.10 Operator Precedence...................................... 46 4.11 Exercises............................................. 47 5 Basic Functions and Numerical Indexing 49 5.1 Generating Arrays........................................ 49 5.2 Rounding............................................. 52 5.3 Mathematics........................................... 52 5.4 Complex Values......................................... 54 5.5 Set Functions........................................... 54 5.6 Sorting and Extreme Values................................... 56 5.7 Nan Functions.......................................... 57 5.8 Functions and Methods/Properties............................... 58 5.9 Exercises............................................. 58 6 Special Arrays 61 6.1 Exercises............................................. 62 7 Array Functions 63 7.1 Shape Information and Transformation............................. 63 7.2 Linear Algebra Functions.................................... 69 7.3 Views............................................... 71 7.4 Exercises............................................. 72 8 Importing and Exporting Data 75 8.1 Importing Data using pandas.................................. 75 8.2 Importing Data without pandas................................. 76 8.3 Saving or Exporting Data using pandas............................ 81 8.4 Saving or Exporting Data without pandas........................... 81 8.5 Exercises............................................. 82 9 Inf, NaN and Numeric Limits 83 9.1 inf and NaN........................................... 83 9.2 Floating point precision..................................... 83 9.3 Exercises............................................. 84 10 Logical Operators and Find 85 10.1 >, >=, >, this indicates that the command is running an interactive IPython session. Output will often appear after the console command, and will not be preceded by a command indicator. 2 Python performance can be made arbitrarily close to C using a variety of methods, including numba (pure python), Cython (C/Python creole language) or directly calling C code. Moreover, recent advances have substantially closed the gap with respect to other Just-in-Time compiled languages such as MATLAB. 1.3 Important Components of the Python Scientific Stack 3 >>> x = 1.0 >>> x + 2 3.0 If the code block does not contain the console session indicator, the code contained in the block is intended to be executed in a standalone Python file. import numpy as np x = np.array([1,2,3,4]) y = np.sum(x) print(x) print(y) 1.3 Important Components of the Python Scientific Stack 1.3.1 Python Python 3.6 (or later) is required, and Python 3.8 (the latest release) is recommended. This provides the core Python interpreter. 1.3.2 NumPy NumPy provides a set of array data types which are essential for statistics, econometrics and data analysis. 1.3.3 SciPy SciPy contains a large number of routines needed for analysis of data. The most important include a wide range of random number generators, linear algebra routines, and optimizers. SciPy depends on NumPy. 1.3.4 Jupyter and IPython IPython provides an interactive Python environment which enhances productivity when developing code or performing interactive data analysis. Jupyter provides a generic set of infrastructure that enables IPython to be run in a variety of settings including an improved console (QtConsole) or in an interactive web-browser based notebook. 1.3.5 matplotlib and seaborn matplotlib provides a plotting environment for 2D plots, with limited support for 3D plotting. seaborn is a Python package that improves the default appearance of matplotlib plots without any additional code. 1.3.6 pandas pandas provides high-performance data structures and is essential when working with data. 1.3.7 statsmodels statsmodels is pandas-aware and provides models used in the statistical analysis of data including linear regres- sion, Generalized Linear Models (GLMs), and time-series models (e.g., ARIMA). 4 Introduction 1.3.8 Performance Modules A number of modules are available to help with performance. These include Cython and Numba. Cython is a Python module which facilitates using a Python-like language to write functions that can be compiled to native (C code) Python extensions. Numba uses a method of just-in-time compilation to translate a subset of Python to native code using Low-Level Virtual Machine (LLVM). 1.4 Setup The recommended method to install the Python scientific stack is to use Continuum Analytics’ Anaconda. Appendix ?? describes a more complex installation procedure with instructions for directly installing Python and the required modules when it is not possible to install Anaconda. Continuum Analytics’ Anaconda Anaconda, a free product of Continuum Analytics (www.continuum.io), is a virtually complete scientific stack for Python. It includes both the core Python interpreter and standard libraries as well as most modules required for data analysis. Anaconda is free to use and modules for accelerating the performance of linear alge- bra on Intel processors using the Math Kernel Library (MKL) are provided. Continuum Analytics also provides other high-performance modules for reading large data files or using the GPU to further accelerate performance for an additional, modest charge. Most importantly, installation is extraordinarily easy on Windows, Linux, and OS X. Anaconda is also simple to update to the latest version using conda update conda conda update anaconda Windows Installation on Windows requires downloading the installer and running. Anaconda comes in both Python 2.7 and 3.x flavors, and the latest Python 3.x is required. These instructions use ANACONDA to indicate the Anaconda installation directory (e.g., the default is C:\Anaconda). Once the setup has completed, open a PowerShell command prompt and run cd ANACONDA\Scripts conda init powershell conda update conda conda update anaconda conda install html5lib seaborn jupyterlab which will first ensure that Anaconda is up-to-date. conda install can be used later to install other packages that may be of interest. Note that if Anaconda is installed into a directory other than the default, the full path should not contain Unicode characters or spaces. Notes The recommended settings for installing Anaconda on Windows are: Install for all users, which requires admin privileges. If these are not available, then choose the “Just for me” option, but be aware of installing on a path that contains non-ASCII characters which can cause issues. 1.5 Using Python 5 Run conda init powershell to ensure that Anaconda commands can be run from the PowerShell prompt. Register Anaconda as the system Python unless you have a specific reason not to (unlikely). Linux and OS X Installation on Linux requires executing bash Anaconda3-x.y.z-Linux-ISA.sh where x.y.z will depend on the version being installed and ISA will be either x86 or more likely x86_64. Anaconda comes in both Python 2.7 and 3.x flavors, and the latest Python 3.x is required. The OS X installer is available either in a GUI installed (pkg format) or as a bash installer which is installed in an identical manner to the Linux installation. It is strongly recommended that the anaconda/bin is prepended to the path. This can be performed in a session-by-session basis by entering conda init bash and then restarting your terminal. Note that other shells such as zsh are also supported, and can be initialized by replacing bash with the name of your preferred shell. After installation completes, execute conda update conda conda update anaconda conda install html5lib seaborn jupyterlab which will first ensure that Anaconda is up-to-date and then install some optional modules. conda install can be used later to install other packages that may be of interest. Notes All instructions for OS X and Linux assume that conda init bash has been run. If this is not the case, it is necessary to run cd ANACONDA cd bin and then all commands must be prepended by a. as in./conda update conda 1.5 Using Python Python can be programmed using an interactive session using IPython or by directly executing Python scripts – text files that end with the extension.py – using the Python interpreter. 1.5.1 Python and IPython Most of this introduction focuses on interactive programming, which has some distinct advantages when learn- ing a language. The standard Python interactive console is very basic and does not support useful features such as tab completion. IPython, and especially the QtConsole version of IPython, transforms the console into a highly productive environment which supports a number of useful features: 6 Introduction Tab completion - After entering 1 or more characters, pressing the tab button will bring up a list of functions, packages, and variables which match the typed text. If the list of matches is large, pressing tab again allows the arrow keys can be used to browse and select a completion. “Magic” function which make tasks such as navigating the local file system (using %cd ~/directory/ or just cd ~/directory/ assuming that %automagic is on) or running other Python programs (using run program.py) simple. Entering %magic inside and IPython session will produce a detailed description of the available functions. Alternatively, %lsmagic produces a succinct list of available magic commands. The most useful magic functions are – cd - change directory – edit filename - launch an editor to edit filename – ls or ls pattern - list the contents of a directory – run filename - run the Python file filename – timeit - time the execution of a piece of code or function – history - view commands recently run. When used with the -l switch, the history of previous ses- sions can be viewed (e.g., history -l 100 will show the most recent 100 commands irrespective of whether they were entered in the current IPython session of a previous one). Integrated help - When using the QtConsole, calling a function provides a view of the top of the help function. For example, entering mean( will produce a view of the top 20 lines of its help text. Inline figures - Both the QtConsole and the notebook can also display figure inline which produces a tidy, self-contained environment. This can be enabled by entering %matplotlib inline in an IPython session. The special variable _ contains the last result in the console, and so the most recent result can be saved to a new variable using the syntax x = _. Support for profiles, which provide further customization of sessions. 1.5.2 Launching IPython OS X and Linux IPython can be started by running ipython in the terminal. IPython using the QtConsole can be started using jupyter qtconsole A single line launcher on OS X or Linux can be constructed using bash -c "jupyter qtconsole" This single line launcher can be saved as filename.command where filename is a meaningful name (e.g. IPython- Terminal) to create a launcher on OS X by entering the command chmod 755 /FULL/PATH/TO/filename.command The same command can to create a Desktop launcher on Ubuntu by running sudo apt-get install --no-install-recommends gnome-panel gnome-desktop-item-edit ~/Desktop/ --create-new and then using the command as the Command in the dialog that appears. 1.5 Using Python 7 Figure 1.1: IPython running in the Windows Terminal app. Windows (Anaconda) To run IPython open PowerShell and enter IPython in the start menu. Starting IPython using the QtConsole is similar and is simply called QtConsole in the start menu. Launching IPython from the start menu should create a window similar to that in figure 1.1. Next, run jupyter qtconsole --generate-config in the terminal or command prompt to generate a file named jupyter_qtconsole_config.py. This file contains settings that are useful for customizing the QtConsole window. A few recommended modifications are c.ConsoleWidget.font_size = 12 c.ConsoleWidget.font_family = "Bitstream Vera Sans Mono" c.JupyterWidget.syntax_style = "monokai" These commands assume that the Bitstream Vera fonts have been locally installed, which are available from http://ftp.gnome.org/pub/GNOME/sources/ttf-bitstream-vera/1.10/. Opening Qt- Console should create a window similar to that in figure 1.2 (although the appearance might differ) if you did not use the recommendation configuration. 1.5.3 Getting Help Help is available in IPython sessions using help(function). Some functions (and modules) have very long help files. When using IPython, these can be paged using the command ?function or function? so that the text can be scrolled using page up and down and q to quit. ??function or function?? can be used to type the entire function including both the docstring and the code. 1.5.4 Running Python programs While interactive programming is useful for learning a language or quickly developing some simple code, complex projects require the use of complete programs. Programs can be run either using the IPython magic 8 Introduction Figure 1.2: IPython running in a QtConsole session. work %run program.py or by directly launching the Python program using the standard interpreter using python program.py. The advantage of using the IPython environment is that the variables used in the program can be inspected after the program run has completed. Directly calling Python will run the program and then terminate, and so it is necessary to output any important results to a file so that they can be viewed later.3 To test that you can successfully execute a Python program, input the code in the block below into a text file and save it as firstprogram.py. # First Python program import time print("Welcome to your first Python program.") input("Press enter to exit the program.") print("Bye!") time.sleep(2) Once you have saved this file, open the console, navigate to the directory you saved the file and enter python firstprogram.py. Finally, run the program in IPython by first launching IPython, and the using %cd to change to the location of the program, and finally executing the program using %run firstprogram.py. 1.5.5 %pylab and %matplotlib When writing Python code, only a small set of core functions and variable types are available in the interpreter. The standard method to access additional variable types or functions is to use imports, which explicitly al- low access to specific packages or functions. While it is best practice to only import required functions or packages, there are many functions in multiple packages that are commonly encountered in these notes. Pylab is a collection of common NumPy, SciPy and Matplotlib functions that can be easily imported using a single command in an IPython session, %pylab. This is nearly equivalent to calling from pylab import ⁎, since it also sets the backend that is used to draw plots. The backend can be manually set using %pylab backend where 3 Programs can also be run in the standard Python interpreter using the command: exec(compile(open(’filename.py’).read(),’filename.py’,’exec’)) 1.5 Using Python 9 Figure 1.3: A successful test that matplotlib, IPython, NumPy and SciPy were all correctly installed. backend is one of the available backends (e.g., qt5 or inline). Similarly %matplotlib backend can be used to set just the backend without importing all of the modules and functions come with %pylab. Most chapters assume that %pylab has been called so that functions provided by NumPy can be called without explicitly importing them. 1.5.6 Testing the Environment To make sure that you have successfully installed the required components, run IPython using shortcut or by running ipython or jupyter qtconsole run in a terminal window. Enter the following commands, one at a time (the meaning of the commands will be covered later in these notes). >>> %pylab qt5 >>> x = randn(100,100) >>> y = mean(x,0) >>> import seaborn >>> plot(y) >>> import scipy as sp If everything was successfully installed, you should see something similar to figure 1.3. 1.5.7 jupyterlab notebooks A jupyter notebook is a simple and useful method to share code with others. Notebooks allow for a fluid synthesis of formatted text, typeset mathematics (using LATEX via MathJax) and Python. The primary method for using notebooks is through a web interface, which allows creation, deletion, export and interactive editing of notebooks. 10 Introduction Figure 1.4: The default IPython Notebook screen showing two notebooks. To launch the jupyterlab server, open a command prompt or terminal and enter jupyter lab This command will start the server and open the default browser which should be a modern version of Chrome (preferable), Chromium, Firefox or Edge. If the default browser is Safari or Internet Explorer, the URL can be copied and pasted into Chrome. The first screen that appears will look similar to figure 1.4, except that the list of notebooks will be empty. Clicking on New Notebook will create a new notebook, which, after a bit of typing, can be transformed to resemble figure 1.5. Notebooks can be imported by dragging and dropping and exported from the menu inside a notebook. 1.5.8 Integrated Development Environments As you progress in Python and begin writing more sophisticated programs, you will find that using an Integrated Development Environment (IDE) will increase your productivity. Most contain productivity enhancements such as built-in consoles, code completion (or IntelliSense, for completing function names) and integrated debugging. Discussion of IDEs is beyond the scope of these notes, although Spyder is a reasonable choice (free, cross-platform). Visual Studio Code is an excellent alternative. My preferred IDE is PyCharm, which has a community edition that is free for use (the professional edition is low cost for academics). spyder spyder is an IDE specialized for use in scientific applications of Python rather than for general purpose applica- tion development. This is both an advantage and a disadvantage when compared to a full featured IDE such as PyCharm or VS Code. The main advantage is that many powerful but complex features are not integrated into Spyder, and so the learning curve is much shallower. The disadvantage is similar - in more complex projects, or if developing something that is not straight scientific Python, Spyder is less capable. However, netting these 1.6 Exercises 11 Figure 1.5: A jupyterlab notebook showing formatted markdown, LATEX math and cells containing code. two, Spyder is almost certainly the IDE to use when starting Python, and it is always relatively simple to migrate to a sophisticated IDE if needed. Spyder is started by entering spyder in the terminal or command prompt. A window similar to that in figure 1.6 should appear. The main components are the editor (1), the object inspector (2), which dynamically will show help for functions that are used in the editor, and the console (3). By default, Spyder opens a standard Python console, although it also supports using the more powerful IPython console. The object inspector window, by default, is grouped with a variable explorer, which shows the variables that are in memory and the file explorer, which can be used to navigate the file system. The console is grouped with an IPython console window (needs to be activated first using the Interpreters menu along the top edge), and the history log which contains a list of commands executed. The buttons along the top edge facilitate saving code, running code and debugging. 1.6 Exercises 1. Install Python. 2. Test the installation using the code in section 1.5.6. 3. Customize IPython QtConsole using a font or color scheme. More customization options can be found by running ipython -h. 4. Explore tab completion in IPython by entering a to see the list of functions which start with a and are loaded by pylab. Next try i, which will produce a list longer than the screen – press ESC to exit the pager. 5. Launch IPython Notebook and run code in the testing section. 6. Open Spyder and explore its features. 12 Introduction Figure 1.6: The default Spyder IDE on Windows. 1.A Additional Installation Issues 1.A.1 Frequently Encountered Problems All Whitespace sensitivity Python is whitespace sensitive and so indentation, either spaces or tabs, affects how Python interprets files. The configuration files, e.g. ipython_config.py, are plain Python files and so are sensitive to whitespace. Introducing white space before the start of a configuration option will produce an error, so ensure there is no whitespace before active lines of a configuration. Windows Spaces in path Python may work when directories have spaces. Unicode in path Python does not always work well when a path contains Unicode characters, which might occur in a user name. While this isn’t an issue for installing Python or Anaconda, it is an issue for IPython which looks in c:\user\username\.ipython for configuration files. The solution is to define the HOME variable before launching IPython to a path that has only ASCII characters. mkdir c:\anaconda\ipython_config set HOME=c:\anaconda\ipython_config 1.A Additional Installation Issues 13 c:\Anaconda\Scripts\activate econometrics ipython profile create econometrics ipython --profile=econometrics The set HOME=c:\anaconda\ipython_config can point to any path with directories containing only ASCII characters, and can also be added to any batch file to achieve the same effect. OS X Installing Anaconda to the root of the partition If the user account used is running as root, then Anaconda may install to /anaconda and not ~/anaconda by default. Best practice is not to run as root, although in principle this is not a problem, and /anaconda can be used in place of ~/anaconda in any of the instructions. 1.A.2 Setup using Virtual Environments The simplest method to install the Python scientific stack is to use directly Continuum Analytics’ Anaconda. These instructions describe alternative installation options using virtual environments, which allow alternative configurations to simultaneously co-exist on a single system. The primary advantage of a virtual environment is that it allows package versions to be frozen so that code that upgrading a module or all of Anaconda does not upgrade the packages in a particular virtual environment. Windows Installation on Windows requires downloading the installer and running. These instructions use ANACONDA to indicate the Anaconda installation directory (e.g. the default is C:\Anaconda). Once the setup has completed, open a PowerShell prompt and run cd ANACONDA\Scripts conda init powershell conda update conda conda update anaconda conda create -n econometrics qtconsole notebook matplotlib numpy pandas scipy spyder statsmodels conda install -n econometrics cython lxml nose numba numexpr pytables sphinx xlrd xlwt html5lib seaborn which will first ensure that Anaconda is up-to-date and then create a virtual environment named economet- rics. Using a virtual environment is a best practice and is important since component updates can lead to errors in otherwise working programs due to backward incompatible changes in a module. The long list of modules in the conda create command includes the core modules. conda install contains the remain- ing packages and is shown as an example of how to add packages to an existing virtual environment af- ter it has been created. It is also possible to install all available Anaconda packages using the command conda create -n econometrics anaconda. The econometrics environment must be activated before use. This is accomplished by running conda activate econometrics from the command prompt, which prepends [econometrics] to the prompt as an indication that virtual environ- ment is active. Activate the econometrics environment and then run cd c:\ ipython 14 Introduction which will open an IPython session using the newly created virtual environment. Virtual environments can also be created using specific versions of packages using pinning. For example, to create a virtual environment names old using Python 3.6 and NumPy 1.16, conda create -n old python=3.6 numpy=1.16 scipy pandas which will install the requested versions of Python and NumPy as well as the latest version of SciPy and pandas that are compatible with the pinned versions. Linux and OS X Installation on Linux requires executing bash Anaconda3-x.y.z-Linux-ISA.sh where x.y.z will depend on the version being installed and ISA will be either x86 or more likely x86_64. The OS X installer is available either in a GUI installed (pkg format) or as a bash installer which is installed in an identical manner to the Linux installation. After installation completes, change to the folder where Anaconda installed (written here as ANACONDA, default ~/anaconda) and execute cd ANACONDA cd bin./conda init bash./conda update conda./conda update anaconda./conda create -n econometrics qtconsole notebook matplotlib numpy pandas scipy spyder statsmodels./conda install -n econometrics cython lxml nose numba numexpr pytables sphinx xlrd xlwt html5lib seaborn which will first ensure that Anaconda is up-to-date and then create a virtual environment named econometrics with the required packages. conda create creates the environment and conda install installs additional packages to the existing environment. conda install can be used later to install other packages that may be of interest. To activate the newly created environment, run conda activate econometrics and then run the command ipython to launch IPython using the newly created virtual environment. Chapter 2 Built-in Data Types Before diving into Python for analyzing data or running Monte Carlos, it is necessary to understand some basic concepts about the core Python data types. Unlike domain-specific languages such as MATLAB or R, where the default data type has been chosen for numerical work, Python is a general purpose programming language which is also well suited to data analysis, econometrics, and statistics. For example, the basic numeric type in MATLAB is an array (using double precision, which is useful for floating point mathematics), while the basic numeric data type in Python is a 1-dimensional scalar which may be either an integer or a double-precision floating point, depending on the formatting of the number when input. 2.1 Variable Names Variable names can take many forms, although they can only contain numbers, letters (both upper and lower), and underscores (_). They must begin with a letter or an underscore and are CaSe SeNsItIve. Additionally, some words are reserved in Python and so cannot be used for variable names (e.g. import or for). For example, x = 1.0 X = 1.0 X1 = 1.0 X1 = 1.0 x1 = 1.0 dell = 1.0 dellreturns = 1.0 dellReturns = 1.0 _x = 1.0 x_ = 1.0 are all legal and distinct variable names. Note that names which begin or end with an underscore, while legal, are not normally used since by convention these convey special meaning.1 Illegal names do not follow these rules. # Not allowed x: = 1.0 1X = 1 X-1 = 1 for = 1 1 Variable names with a single leading underscore, for example _some_internal_value, indicate that the variable is for internal use by a module or class. While indicated to be private, this variable will generally be accessible by calling code. Double leading underscores, for example __some_private_value, indicate that a value is actually private and is not accessible. Variable names with trailing underscores are used to avoid conflicts with reserved Python words such as class_ or lambda_. Double leading and trailing underscores are reserved for “magic” variable (e.g. __init__) , and so should be avoided except when specifically accessing a feature. 16 Built-in Data Types Multiple variables can be assigned on the same line using commas, x, y, z = 1, 3.1415, 'a' 2.2 Core Native Data Types 2.2.1 Numeric Simple numbers in Python can be either integers, floats or complex. This chapter does not cover all Python data types and instead focuses on those which are most relevant for numerical analysis, econometrics, and statistics. The byte, bytearray and memoryview data types are not described. 2.2.1.1 Floating Point (float) The most important (scalar) data type for numerical analysis is the float. Unfortunately, not all non-complex numeric data types are floats. To input a floating data type, it is necessary to include a. (period, dot) in the expression. This example uses the function type() to determine the data type of a variable. >>> x = 1 >>> type(x) int >>> x = 1.0 >>> type(x) float >>> x = float(1) >>> type(x) float This example shows that using the expression that x = 1 produces an integer-valued variable while x = 1.0 produces a float-valued variable. Using integers can produce unexpected results and so it is important to include “.0” when expecting a float. 2.2.1.2 Complex (complex) Complex numbers are also important for numerical analysis. Complex numbers are created in Python using j or the function complex(). >>> x = 1.0 >>> type(x) float >>> x = 1j >>> type(x) complex >>> x = 2 + 3j >>> x (2+3j) >>> x = complex(1) >>> x (1+0j) Note that a+bj is the same as complex(a,b), while complex(a) is the same as a+0j. 2.2 Core Native Data Types 17 2.2.1.3 Integers (int) Floats use an approximation to represent numbers which may contain a decimal portion. The integer data type stores numbers using an exact representation, so that no approximation is needed. The cost of the exact representation is that the integer data type cannot express anything that isn’t an integer, rendering integers of limited use in most numerical work. Basic integers can be entered either by excluding the decimal (see float), or explicitly using the int() function. The int() function can also be used to convert a float to an integer by round towards 0. >>> x = 1 >>> type(x) int >>> x = 1.0 >>> type(x) float >>> x = int(x) >>> type(x) int Python integers support have unlimited range since the amount of bits used to store an integer is dynamic. >>> x = 1 >>> x 1 >>> type(x) int >>> x = 2 ⁎⁎ 127 + 2 ⁎⁎ 65 # ⁎⁎ is denotes exponentiation, y^64 in TeX >>> x 170141183460469231768580791863303208960 2.2.2 Boolean (bool) The Boolean data type is used to represent true and false, using the reserved keywords True and False. Boolean variables are important for program flow control (see Chapter 12) and are typically created as a result of logical operations (see Chapter 10), although they can be entered directly. >>> x = True >>> type(x) bool >>> x = bool(1) >>> x True >>> x = bool(0) >>> x False Non-zero, non-empty values generally evaluate to true when evaluated by bool(). Zero or empty values such as bool(0), bool(0.0), bool(0.0j), bool(None), bool('') and bool([]) are all false. 18 Built-in Data Types 2.2.3 Strings (str) Strings are not usually important for numerical analysis, although they are frequently encountered when dealing with data files, especially when importing or when formatting output for human consumption. Strings are delimited using single quotes ('') or double quotes ("") but not using combination of the two delimiters (i.e., do not use '") in a single string, except when used to express a quotation. >>> x = 'abc' >>> type(x) str >>> y = '"A quotation!"' >>> print(y) "A quotation!" String manipulation is further discussed in Chapter 23. 2.2.3.1 Slicing Strings Substrings within a string can be accessed using slicing. Slicing uses [] to contain the indices of the characters in a string, where the first index is 0, and the last is n − 1 (assuming the string has n letters). The following table describes the types of slices which are available. The most useful are s[i], which will return the character in position i, s[:i], which return the leading characters from positions 0 to i − 1, and s[i:] which returns the trailing characters from positions i to n − 1. The table below provides a list of the types of slices which can be used. The second column shows that slicing can use negative indices which essentially index the string backward. Slice Behavior s[:] Entire string s[i] Charactersi s[i:] Charactersi,... , n − 1 s[:i] Characters0,... , i − 1 s[i: j ] Charactersi,... , j − 1 s[i: j :m] Charactersi,i + m,...i + m⌊ j−i−1 m ⌋ s[−i] Characters n − i s[−i:] Charactersn − i,... , n − 1 s[:−i] Characters0,... , n − i − 1 s[− j :−i] Characters n − j,... , n − i − 1, − j < −i s[− j :−i:m] Characters n − j,n − j + m,...,n − j + m⌊ j−i−1 m ⌋ >>> text = 'Python strings are sliceable.' >>> text 'P' >>> text 'i' >>> L = len(text) >>> text[L] # Error IndexError: string index out of range >>> text[L-1] 2.2 Core Native Data Types 19 '.' >>> text[:10] 'Python str' >>> text[10:] 'ings are sliceable.' 2.2.4 Lists (list) Lists are a built-in container data type which hold other data. A list is a collection of other objects – floats, integers, complex numbers, strings or even other lists. Lists are essential to Python programming and are used to store collections of other values. For example, a list of floats can be used to express a vector (although the NumPy data type array is better suited to working with collections of numeric values). Lists also support slicing to retrieve one or more elements. Basic lists are constructed using square braces, [], and values are separated using commas. >>> x = [] >>> type(x) builtins.list >>> x=[1,2,3,4] >>> x [1,2,3,4] # 2-dimensional list (list of lists) >>> x = [[1,2,3,4], [5,6,7,8]] >>> x [[1, 2, 3, 4], [5, 6, 7, 8]] # Jagged list, not rectangular >>> x = [[1,2,3,4] , [5,6,7]] >>> x [[1, 2, 3, 4], [5, 6, 7]] # Mixed data types >>> x = [1,1.0,1+0j,'one',None,True] >>> x [1, 1.0, (1+0j), 'one', None, True] These examples show that lists can be regular, nested and can contain any mix of data types including other lists. 2.2.4.1 Slicing Lists Lists, like strings, can be sliced. Slicing is similar, although lists can be sliced in more ways than strings. The difference arises since lists can be multi-dimensional while strings are always 1×n. Basic list slicing is identical to slicing strings, and operations such as x[:], x[1:], x[:1] and x[-3:] can all be used. To understand slicing, assume x is a 1-dimensional list with n elements and i ≥ 0, j > 0, i < j,m ≥ 1. Python uses 0-based indices, and so the n elements of x can be thought of as x0 , x1 ,... , xn−1. 20 Built-in Data Types Slice Behavior, Slice Behavior x[:] Return all x x[−i] Returns xn−i except when i = −0 x[i] Return xi x[−i:] Return xn−i ,... , xn−1 x[i:] Return xi ,... xn−1 x[:−i] Return x0 ,... , xn−i−1 x[:i] Return x0 ,... , xi−1 x[− j:−i] Return xn− j ,... , xn−i−1 x[i: j] Return xi , xi+1 ,... x j−1 x[− j:−i:m] Returns xn− j ,xn− j+m ,...,xn− j+m⌊ j−i−1 ⌋ m x[i: j:m] Returns xi ,xi+m ,...xi+m⌊ j−i−1 ⌋ m The default list slice uses a unit stride (step size of one). It is possible to use other strides using a third input in the slice so that the slice takes the form x[i:j:m] where i is the index to start, j is the index to end (exclusive) and m is the stride length. For example x[::2] will select every second element of a list and is equivalent to x[0:n:2] where n = len(x). The stride can also be negative which can be used to select the elements of a list in reverse order. For example, x[::-1] will reverse a list and is equivalent to x[0:n:-1]. Examples of accessing elements of 1-dimensional lists are presented below. >>> x = [0,1,2,3,4,5,6,7,8,9] >>> x 0 >>> x 5 >>> x # Error IndexError: list index out of range >>> x[4:] [4, 5, 6, 7, 8, 9] >>> x[:4] [0, 1, 2, 3] >>> x[1:4] [1, 2, 3] >>> x[-0] 0 >>> x[-1] 9 >>> x[-10:-1] [0, 1, 2, 3, 4, 5, 6, 7, 8] List can be multidimensional, and slicing can be done directly in higher dimensions. For simplicity, consider slicing a 2-dimensional list x = [[1,2,3,4], [5,6,7,8]]. If single indexing is used, x will return the first (inner) list, and x will return the second (inner) list. Since the list returned by x is sliceable, the inner list can be directly sliced using x or x[1:4]. >>> x = [[1,2,3,4], [5,6,7,8]] >>> x [1, 2, 3, 4] >>> x [5, 6, 7, 8] >>> x 1 >>> x[1:4] [2, 3, 4] >>> x[-4:-1] [5, 6, 7] 2.2.4.2 List Functions A number of functions are available for manipulating lists. The most useful are 2.2 Core Native Data Types 21 Function Method Description list.append(x,value) x.append(value) Appends value to the end of the list. len(x) – Returns the number of elements in the list. list.extend(x,list) x.extend(list) Appends the values in list to the existing list.2 list.pop(x,index) x.pop(index) Removes the value in position index and returns the value. list.remove(x,value) x.remove(value) Removes the first occurrence of value from the list. list.count(x,value) x.count(value) Counts the number of occurrences of value in the list. del x[slice] Deletes the elements in slice. >>> x = [0,1,2,3,4,5,6,7,8,9] >>> x.append(0) >>> x [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0] >>> len(x) 11 >>> x.extend([11,12,13]) >>> x [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13] >>> x.pop(1) 1 >>> x [0, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13] >>> x.remove(0) >>> x [2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13] Elements can also be deleted from lists using the keyword del in combination with a slice. >>> x = [0,1,2,3,4,5,6,7,8,9] >>> del x >>> x [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> x[:3] [1, 2, 3] >>> del x[:3] >>> x [4, 5, 6, 7, 8, 9] >>> del x[1:3] >>> x [4, 7, 8, 9] >>> del x[:] >>> x [] 2.2.5 Tuples (tuple) A tuple is virtually identical to a list with one important difference – tuples are immutable. Immutability means that a tuple cannot be changed once created. It is not possible to add, remove, or replace elements in a tuple. 22 Built-in Data Types However, if a tuple contains a mutable data type, for example a tuple that contains a list, the contents mutable data type can be altered. Tuples are constructed using parentheses (()) in place of the square brackets ([]) used to create lists. Tuples can be sliced in an identical manner as lists. A list can be converted into a tuple using tuple() (Similarly, a tuple can be converted to list using list()). >>> x =(0,1,2,3,4,5,6,7,8,9) >>> type(x) tuple >>> x 0 >>> x[-10:-5] (0, 1, 2, 3, 4) >>> x = list(x) >>> type(x) list >>> x = tuple(x) >>> type(x) tuple >>> x= ([1,2],[3,4]) >>> x = -10 >>> x # Contents can change, elements cannot ([1, -10], [3, 4]) Note that tuples containing a single element must contain a comma when created, so that x = (2,) is assign a tuple to x, while x=(2) will assign 2 to x. The latter interprets the parentheses as if they are part of a mathematical formula rather than being used to construct a tuple. x = tuple() can also be used to create a single element tuple. Lists do not have this issue since square brackets do not have this ambiguity. >>> x =(2) >>> type(x) int >>> x = (2,) >>> type(x) tuple >>> x = tuple() >>> type(x) tuple 2.2.5.1 Tuple Functions Tuples are immutable, and so only have the methods index and count, which behave in an identical manner to their list counterparts. 2.2.6 Dictionary (dict) Dictionaries are encountered far less frequently than then any of the previously described data types in numer- ical Python. They are, however, commonly used to pass options into other functions such as optimizers, and so familiarity with dictionaries is important. Dictionaries in Python are composed of keys (words) and values 2.2 Core Native Data Types 23 (definitions). Dictionaries keys must be unique immutable data types (e.g. strings, the most common key, in- tegers, or tuples containing immutable types), and values can contain any valid Python data type.3 Values are accessed using keys. >>> data = {'age': 34, 'children' : [1,2], 1: 'apple'} >>> type(data) dict >>> data['age'] 34 Values associated with an existing key can be updated by making an assignment to the key in the dictionary. >>> data['age'] = 'xyz' >>> data['age'] 'xyz' New key-value pairs can be added by defining a new key and assigning a value to it. >>> data['name'] = 'abc' >>> data {1: 'apple', 'age': 'xyz', 'children': [1, 2], 'name': 'abc'} Key-value pairs can be deleted using the reserved keyword del. >>> del data['age'] >>> data {1: 'apple', 'children': [1, 2], 'name': 'abc'} 2.2.7 Sets (set, frozenset) Sets are collections which contain all unique elements of a collection. set and frozenset only differ in that the latter is immutable (and so has higher performance), and so set is similar to a unique list while frozenset is similar to a unique tuple. While sets are generally not important in numerical analysis, they can be very useful when working with messy data – for example, finding the set of unique tickers in a long list of tickers. 2.2.7.1 Set Functions A number of methods are available for manipulating sets. The most useful are Function Method Description set.add(x,element) x.add(element) Appends element to a set. len(x) – Returns the number of elements in the set. set.difference(x,set) x.difference(set) Returns the elements in x which are not in set. set.intersection(x,set) x.intersection(set) Returns the elements of x which are also in set. set.remove(x,element) x.remove(element) Removes element from the set. set.union(x,set) x.union(set) Returns the set containing all elements of x and set. The code below demonstrates the use of set. Note that 'MSFT' is repeated in the list used to initialize the set, but only appears once in the set since all elements must be unique. >>> x = set(['MSFT','GOOG','AAPL','HPQ','MSFT']) >>> x {'AAPL', 'GOOG', 'HPQ', 'MSFT'} 3 Formally dictionary keys must support the __hash__ function, equality comparison and it must be the case that different keys have different hashes. 24 Built-in Data Types >>> x.add('CSCO') >>> x {'AAPL', 'CSCO', 'GOOG', 'HPQ', 'MSFT'} >>> y = set(['XOM', 'GOOG']) >>> x.intersection(y) {'GOOG'} >>> x = x.union(y) >>> x {'AAPL', 'CSCO', 'GOOG', 'HPQ', 'MSFT', 'XOM'} >>> x.remove('XOM') {'AAPL', 'CSCO', 'GOOG', 'HPQ', 'MSFT'} A frozenset supports the same methods except add and remove. 2.2.8 range A range is most commonly encountered in a for loop. range(a,b,i) creates the sequences that follows the pattern a, a + i, a + 2i,... , a + (m − 1)i where m = ⌈ b−a i ⌉. In other words, it find all integers x starting with a such a ≤ x < b and where two consecutive values are separated by i. range can be called with 1 or two parameters – range(a,b) is the same as range(a,b,1) and range(b) is the same as range(0,b,1). >>> x = range(10) >>> type(x) range >>> print(x) range(0, 10) >>> list(x) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> x = range(3,10) >>> list(x) [3, 4, 5, 6, 7, 8, 9] >>> x = range(3,10,3) >>> list(x) [3, 6, 9] range is not technically a list, which is why the statement print(x) returns range(0,10). Explicitly converting with list produces a list which allows the values to be printed. range is technically an iterator which does not actually require the storage space of a list. 2.3 Additional Container Data Types in the Standard Library Python includes an extensive standard library that provides many features that extend the core Python language. Data types in the standard library are always installed alongside the Python interpreter. However, they are not “built-in” since using one requires an explicit import statement to make a particular data type available. The standard library is vast and some examples of the included functionality are support for working with dates (provided by the datetime module, see Chapter 14), functional programming tools (itertools , functools and operator), tools for accessing the file system (os.path and glob inter alia., see Chapter 24), and support for interacting with resources on the the internet (urllib and ftplib inter alia.). One of the more useful modules included in the standard library is the collections module. This module provides a set of specialized 2.3 Additional Container Data Types in the Standard Library 25 container data types that extend the built-in data container data types. Two are particularly useful when working with data: OrderedDict and defaultdict. Both of these extend the built-in dictionary dict with useful features. 2.3.1 OrderedDict When using a standard Python dict, items order is not guaranteed. OrderedDict addresses this frequent short- coming by retaining a list of the keys inserted into the dictionary in the order in which they have been inserted. The order is also preserved when deleting keys from an OrderedDict. >>> from collections import OrderedDict >>> od = OrderedDict() >>> od['key1'] = 1 >>> od['key2'] = 'a' >>> od['key3'] = 'alpha' >>> plain = dict(od) >>> print(list(od.keys())) ['key1', 'key2', 'key3'] >>> print(list(plain.keys())) ['key2', 'key1', 'key3'] >>> del od['key1'] >>> print(list(od.keys())) ['key2', 'key3'] >>> od['key1'] = 'some other value' print(list(od.keys())) ['key2', 'key3', 'key1'] This functionality is particularly useful when iterating over the keys in a dictionary since it guarantees a pre- dictable order when accessing the keys (see Chapter 12). Recent versions of pandas also respect the order in an OrderedDict when adding columns to a DataFrame (see Chapter 16). 2.3.2 defaultdict By default attempting to access a key in a dictionary that does not exist will produce an error. There are circumstances where this is undesirable, and when a key is encountered that doesn’t exist, a default value should be added to the dictionary and returns. One particularly useful example of this behavior is when making keyed lists – that is, grouping like elements according to a key in a list. If the key exists, the elements should be appended to the existing list. If the key doesn’t exist, the key should be added and a new list containing the new element should be inserted into the disctionary. defaultdict enables this exact scenario by accepting a callable function as an argument. When a key is found, it behaved just like a standard dictionary. When a key isn’t found, the output of the callable function is assigned to the key. This example uses list to add a new list whenever a key is not found. >>> d = {} >>> d['one'].append('an item') # Error KeyError: 'one' >>> from collections import defaultdict >>> dd = defaultdict(list) >>> dd['one'].append('first') >>> dd['one'].append('second') >>> dd['two'].append('third') >>> print(dd) 26 Built-in Data Types defaultdict(, {'one': ['first', 'second'], 'two': ['third']}) The callable argument provided to defaultdict can be anything that is useful including other containers, objects that will be initialized the first time called, or an anonymous function (i.e. a function defined using lambda, see Section 18.4). 2.4 Python and Memory Management Python uses a highly optimized memory allocation system which attempts to avoid allocating unnecessary memory. As a result, when one variable is assigned to another (e.g. to y = x), these will actually point to the same data in the computer’s memory. To verify this, id() can be used to determine the unique identification number of a piece of data.4 >>> x = 1 >>> y = x >>> id(x) 82970264 >>> id(y) 82970264 >>> x = 2.0 >>> id(x) 93850568 >>> id(y) 82970264 In the above example, the initial assignment of y = x produced two variables with the same ID. However, once x was changed, its ID changed while the ID of y did not, indicating that the data in each variable was stored in different locations. This behavior is both safe and efficient and is common to the basic Python immutable types: int, float, complex, string, tuple, frozenset and range. 2.4.1 Example: Lists Lists are mutable and so assignment does not create a copy , and so changes to either variable affect both. >>> x = [1, 2, 3] >>> y = x >>> y = -10 >>> y [-10, 2, 3] >>> x [-10, 2, 3] Slicing a list creates a copy of the list and any immutable types in the list – but not mutable elements in the list. >>> x = [1, 2, 3] >>> y = x[:] >>> id(x) 86245960 >>> id(y) 86240776 4 The ID numbers on your system will likely differ from those in the code listing. 2.5 Exercises 27 To see that the inner lists are not copied, consider the behavior of changing one element in a nested list. >>> x=[[0,1],[2,3]] >>> y = x[:] >>> y [[0, 1], [2, 3]] >>> id(x) 117011656 >>> id(y) 117011656 >>> x 0.0 >>> id(x) 30390080 >>> id(y) 30390080 >>> y = -10.0 >>> y [[-10.0, 1], [2, 3]] >>> x [[-10.0, 1], [2, 3]] When lists are nested or contain other mutable objects (which do not copy), slicing copies the outermost list to a new ID, but the inner lists (or other objects) are still linked. In order to copy nested lists, it is necessary to explicitly call deepcopy(), which is in the module copy. >>> import copy as cp >>> x=[[0,1],[2,3]] >>> y = cp.deepcopy(x) >>> y = -10.0 >>> y [[-10.0, 1], [2, 3]] >>> x [[0, 1], [2, 3]] 2.5 Exercises 1. Enter the following into Python, assigning each to a unique variable name: (a) 4 (b) 3.1415 (c) 1.0 (d) 2+4j (e) 'Hello' (f) 'World' 2. What is the type of each variable? Use type if you aren’t sure. 28 Built-in Data Types 3. Which of the 6 types can be: (a) Added + (b) Subtracted - (c) Multiplied ⁎ (d) Divided / 4. What are the types of the output (when an error is not produced) in the above operations? 5. Input the variable ex = 'Python is an interesting and useful language for numerical computing!' Using slicing, extract the text strings below. Note: There are multiple answers for all of the problems. (a) Python (b) ! (c) computing (d) in (e) !gnitupmoc laciremun rof egaugnal lufesu dna gnitseretni na si nohtyP' (Reversed) (f) nohtyP (g) Pto sa neetn n sfllnug o ueia optn! 6. What are the direct 2 methods to construct a tuple that has only a single item? How many ways are there to construct a list with a single item? 7. Construct a nested list to hold the array 1.5.5 1 so that item [i][j] corresponds to the position in the array (Remember that Python uses 0 indexing). 8. Assign the array you just created first to x, and then assign y=x. Change y to 1.61. What happens to x? 9. Next assign z=x[:] using a simple slice. Repeat the same exercise using y = 1j. What happens to x and z ? What are the ids of x, y and z? What about x, y and z? 10. How could you create w from x so that w can be changed without affecting x? 11. Initialize a list containing 4, 3.1415, 1.0, 2+4j, 'Hello', 'World'. How could you: (a) Delete 1.0 if you knew its position? What if you didn’t know its position? (b) How can the list [1.0, 2+4j, 'Hello'] be added to the existing list? (c) How can the list be reversed? (d) In the extended list, how can you count the occurrence of 'Hello'? 12. Construct a dictionary with the keyword-value pairs: 'alpha' and 1.0, 'beta' and 3.1415, 'gamma' and -99. How can the value of alpha be retrieved? 13. Convert the final list at the end of problem 11 to a set. How is the set different from the list? Chapter 3 Arrays NumPy provides the core data type for numerical analysis – arrays. NumPy arrays are widely used through the Python econsystem and are extended by other key libraries including pandas, an essential library for data analysis. 3.1 Array Arrays are the base data type in NumPy, are in similar to lists or tuples since they both contain collections of el- ements. The focus of this section is on homogeneous arrays containing numeric data – that is, an array where all elements have the same numeric type (heterogeneous arrays are covered in Chapters 17 and 16). Additionally, arrays, unlike lists, are always rectangular so that all dimensions have the same number of elements. Arrays are initialized from lists (or tuples) using array. Two-dimensional arrays are initialized using lists of lists (or tuples of tuples, or lists of tuples, etc.), and higher dimensional arrays can be initialized by further nesting lists or tuples. >>> from numpy import array >>> x = [0.0, 1, 2, 3, 4] >>> y = array(x) >>> y array([ 0., 1., 2., 3., 4.]) >>> type(y) numpy.ndarray Two (or higher) -dimensional arrays are initialized using nested lists. >>> y = array([[0.0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]) >>> y array([[ 0., 1., 2., 3., 4.], [ 5., 6., 7., 8., 9.]]) >>> shape(y) (2, 5) >>> y = array([[[1,2],[3,4]],[[5,6],[7,8]]]) >>> y array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) 30 Arrays >>> shape(y) (2, 2, 2) 3.1.1 Array dtypes Homogeneous arrays can contain a variety of numeric data types. The most common data type is float64 (or double), which corresponds to the python built-in data type of float (and C/C++ double). By default, calls to array will preserve the type of the input, if possible. If an input contains all integers, it will have a dtype of int32 (similar to the built-in data type int). If an input contains integers, floats, or a mix of the two, the array’s data type will be float64. If the input contains a mix of integers, floats and complex types, the array will be initialized to hold complex data. >>> x = [0, 1, 2, 3, 4] # Integers >>> y = array(x) >>> y.dtype dtype('int32') >>> x = [0.0, 1, 2, 3, 4] # 0.0 is a float >>> y = array(x) >>> y.dtype dtype('float64') >>> x = [0.0 + 1j, 1, 2, 3, 4] # (0.0 + 1j) is a complex >>> y = array(x) >>> y array([ 0.+1.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j]) >>> y.dtype dtype('complex128') NumPy attempts to find the smallest data type which can represent the data when constructing an array. It is possible to force NumPy to select a particular dtype by using the keyword argument dtype=datatype when initializing the array. >>> x = [0, 1, 2, 3, 4] # Integers >>> y = array(x) >>> y.dtype dtype('int32') >>> y = array(x, dtype='float64') # String dtype >>> y.dtype dtype('float64') >>> y = array(x, dtype=float32) # NumPy type dtype >>> y.dtype dtype('float32') 3.2 1-dimensional Arrays A vector x = [1 2 3 4 5] is entered as a 1-dimensional array using 3.3 2-dimensional Arrays 31 >>> x=array([1.0, 2.0, 3.0, 4.0, 5.0]) array([ 1., 2., 3., 4., 5.]) >>> ndim(x) 1 If an array with 2-dimensions is required, it is necessary to use a trivial nested list. >>> x=array([[1.0,2.0,3.0,4.0,5.0]]) array([[ 1., 2., 3., 4., 5.]]) >>> ndim(x) 2 Notice that the output representation uses nested lists ([[ ]]) to emphasize the 2-dimensional structure of the array. The column vector, 1 2 x= 3 4 5 is entered as a 2-dimensional array using a set of nested lists >>> x = array([[1.0],[2.0],[3.0],[4.0],[5.0]]) >>> x array([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]) 3.3 2-dimensional Arrays Two-dimensional arrays are rows of columns, and so 1 2 3 x = 4 5 6 , 7 8 9 is input by enter the array one row at a time, each in a list, and then encapsulate the row lists in another list. >>> x = array([[1.0,2.0,3.0],[4.0,5.0,6.0],[7.0,8.0,9.0]]) >>> x array([[ 1., 2., 3.], [ 4., 5., 6.], [ 7., 8., 9.]]) 3.4 Multidimensional Arrays Higher dimensional arrays have a number of uses, for example when modeling a time-varying covariance. Multidimensional (N-dimensional) arrays are available for N up to about 30, depending on the size of each dimension. Manually initializing higher dimension arrays is tedious and error prone, and so it is better to use functions such as zeros((2, 2, 2)) or empty((2, 2, 2)). 32 Arrays Matrix Matrices are essentially a subset of arrays and behave in a virtually identical manner. The matrix class is deprecated and so should not be used. While NumPy is likely to support the matrix class for the forseeable future, its use is discouraged. In practice, there is no good reason to not use 2-dimensional arrays. The two important differences are: Matrices always have 2 dimensions Matrices follow the rules of linear algebra for ⁎ 1- and 2-dimensional arrays can be copied to a matrix by calling matrix on an array. Alternatively, mat or asmatrix provides a faster method to coerce an array to behave like a matrix without copying any data. >>> x = [0.0, 1, 2, 3, 4] # Any float makes all float >>> y = array(x) >>> type(y) numpy.ndarray >>> y ⁎ y # Element-by-element array([ 0., 1., 4., 9., 16.]) >>> z = asmatrix(x) >>> type(z) numpy.matrixlib.defmatrix.matrix >>> z ⁎ z # Error ValueError: matrices are not aligned 3.5 Concatenation Concatenation is the process by which one array is appended to another. Arrays can be concatenation horizon- tally or vertically. For example, suppose 1 2 5 6 x x= and y = and z = 3 4 7 8 y needs to be constructed. This can be accomplished by treating x and y as elements of a new array and using the function concatenate to join them. The inputs to concatenate must be grouped in a tuple and the key- word argument axis specifies whether the arrays are to be vertically (axis = 0) or horizontally (axis = 1) concatenated. >>> x = array([[1.0,2.0],[3.0,4.0]]) >>> y = array([[5.0,6.0],[7.0,8.0]]) >>> z = concatenate((x,y),axis = 0) >>> z array([[ 1., 2.], [ 3., 4.], [ 5., 6.], [ 7., 8.]]) >>> z = concatenate((x,y),axis = 1) >>> z array([[ 1., 2., 5., 6.], 3.6 Accessing Elements of an Array 33 [ 3., 4., 7., 8.]]) Concatenating is the code equivalent of block forms in linear algebra. Alternatively, the functions vstack and hstack can be used to vertically or horizontally stack arrays, respectively. >>> z = vstack((x,y)) # Same as z = concatenate((x,y),axis = 0) >>> z = hstack((x,y)) # Same as z = concatenate((x,y),axis = 1) 3.6 Accessing Elements of an Array Four methods are available for accessing elements contained within an array: scalar selection, slicing, numerical indexing and logical (or Boolean) indexing. Scalar selection and slicing are the simplest and so are presented first. Numerical indexing and logical indexing both depends on specialized functions and so these methods are discussed in Chapter 11. 3.6.1 Scalar Selection Pure scalar selection is the simplest method to select elements from an array, and is implemented using [i] for 1-dimensional arrays, [i, j] for 2-dimensional arrays and [i1 ,i2 ,...,in ] for general n-dimensional arrays. Like all indexing in Python, selection is 0-based so that is the first element in a 1-d array, [0,0] is the upper left element in a 2-d array, and so on. >>> x = array([1.0,2.0,3.0,4.0,5.0]) >>> x 1.0 >>> x = array([[1.0,2,3],[4,5,6]]) >>> x array([[ 1., 2., 3.], [ 4., 5., 6.]]) >>> x[1, 2] 6.0 >>> type(x[1,2]) numpy.float64 Pure scalar selection always returns a single element which is not an array. The data type of the selected element matches the data type of the array used in the selection. Scalar selection can also be used to assign values in an array. >>> x = array([1.0,2.0,3.0,4.0,5.0]) >>> x = -5 >>> x array([-5., 2., 3., 4., 5.]) 3.6.2 Array Slicing Arrays, like lists and tuples, can be sliced. Arrays slicing is virtually identical list slicing except that a simpler slicing syntax is available when using multiple dimensions. Arrays are sliced using the syntax [:,:,...,:] (where the number of dimensions of the arrays determines the size of the slice).1 Recall that the slice notation 1 It is not necessary to include all trailing slice dimensions, and any omitted trailing slices are set to select all elements (the slice :). For example, if x is a 3-dimensional array, x[0:2] is the same as x[0:2,:,:] and x[0:2,0:2] is the same as x[0:2,0:2,:]. 34 Arrays a:b:s will select every sth element where the indices i satisfy a ≤ i < b so that the starting value a is always included in the list and the ending value b is always excluded. Additionally, a number of shorthand notations are commonly encountered : and :: are the same as 0:n:1 where n is the length of the array (or list). a: and a:n are the same as a:n:1 where n is the length of the array (or list). :b is the same as 0:b:1. ::s is the same as 0:n:s where n is the length of the array (or list). Basic slicing of 1-dimensional arrays is identical to slicing a simple list, and the returned type of all slicing operations matches the array being sliced. >>> x = array([1.0,2.0,3.0,4.0,5.0]) >>> y = x[:] array([ 1., 2., 3., 4., 5.]) >>> y = x[:2] array([ 1., 2.]) >>> y = x[1::2] array([ 2., 4.]) In 2-dimensional arrays, the first dimension specifies the row or rows of the slice and the second dimen- sion specifies the column or columns. Note that the 2-dimensional slice syntax y[a:b,c:d] is the same as y[a:b,:][:,c:d] or y[a:b][:,c:d], although the shorter form is preferred. In the case where only row slicing in needed y[a:b], the equivalent of y[a:b,:], is the shortest syntax. >>> y = array([[0.0, 1, 2, 3, 4],[5, 6, 7, 8, 9]]) >>> y array([[ 0., 1., 2., 3., 4.], [ 5., 6., 7., 8., 9.]]) >>> y[:1,:] # Row 0, all columns array([[ 0., 1., 2., 3., 4.]]) >> y[:1] # Same as y[:1,:] array([[ 0., 1., 2., 3., 4.]]) >>> y[:,:1] # all rows, column 0 array([[ 0.], [ 5.]]) >>> y[:1,0:3] # Row 0, columns 0 to 2 array([[ 0., 1., 2.]]) >>> y[:1][:,0:3] # Same as previous array([[ 0., 1., 2.]]) >>> y[:,3:] # All rows, columns 3 and 4 array([[ 3., 4.], [ 8., 9.]]) >>> y = array([[[1.0,2],[3,4]],[[5,6],[7,8]]]) >>> y[:1,:,:] # Panel 0 of 3D y array([[[ 1., 2.], [ 3., 4.]]]) 3.6 Accessing Elements of an Array 35 In the previous examples, slice notation was always used even when only selecting 1 row or column. This was done to emphasize the difference between using slice notation, which always returns an array with the same dimension and using a scalar selector which will perform dimension reduction. 3.6.3 Mixed Selection using Scalar and Slice Selectors When arrays have more than 1-dimension, it is often useful to mix scalar and slice selectors to select an entire row, column or panel of a 3-dimensional array. This is similar to pure slicing with one important caveat – dimensions selected using scalar selectors are eliminated. For example, if x is a 2-dimensional array, then x[0,:] will select the first row. However, unlike the 2-dimensional array constructed using the slice x[:1,:], x[0,:] will be a 1-dimensional array. >>> x = array([[1.0,2],[3,4]]) >>> x[:1,:] # Row 1, all columns, 2-dimensional array([[ 1., 2.]]) >>> x[0,:] # Row 1, all columns, dimension reduced array([ 1., 2.]) While these two selections appear similar, the first produces a 2-dimensional array (note the [[ ]] syntax) while the second is a 1-dimensional array. In most cases where a single row or column is required, using scalar selectors such as y[0,:] is the best practice. It is important to be aware of the dimension reduction since scalar selections from 2-dimensional arrays will not have 2-dimensions. This type of dimension reduction may matter when evaluating linear algebra expression. The principle adopted by NumPy is that slicing should always preserve the dimension of the underlying array, while scalar indexing should always collapse the dimension(s). This is consistent with x[0,0] returning a scalar (or 0-dimensional array) since both selections are scalar. This is demonstrated in the next example which highlights the differences between pure slicing, mixed slicing, and pure scalar selection. Note that the function ndim returns the number of dimensions of an array. >>> x = array([[0.0, 1, 2, 3, 4],[5, 6, 7, 8, 9]]) >>> x[:1,:] # Row 0, all columns, 2-dimensional array([[ 0., 1., 2., 3., 4.]]) >>> ndim(x[:1,:]) 2 >>> x[0,:] # Row 0, all column, dim reduction to 1-d array array([ 0., 1., 2., 3., 4.]) >>> ndim(x[0,:]) 1 >>> x[0,0] # Top left element, dim reduction to scalar (0-d array) 0.0 >>> ndim(x[0,0]) 0 >>> x[:,0] # All rows, 1 column, dim reduction to 1-d array array([ 0., 5.]) 36 Arrays 3.6.4 Assignment using Slicing Slicing and scalar selection can be used to assign arrays that have the same dimension as the slice.2 >>> x = array([[0.0]⁎3]⁎3) # ⁎3 repeats the list 3 times >>> x array([[0, 0, 0], [0, 0, 0], [0, 0, 0]]) >>> x[0,:] = array([1.0, 2.0, 3.0]) >>> x array([[ 1., 2., 3.], [ 0., 0., 0.], [ 0., 0., 0.]]) >>> x[::2,::2] array([[ 1., 3.], [ 0., 0.]]) >>> x[::2,::2] = array([[-99.0,-99],[-99,-99]]) # 2 by 2 >>> x array([[-99., 2., -99.], [ 0., 0., 0.], [-99., 0., -99.]]) >>> x[1,1] = pi >>> x array([[-99. , 2. , -99. ], [ 0. , 3.14159265, 0. ], [-99. , 0. , -99. ]]) NumPy attempts to automatic (silent) data type conversion if an element with one data type is inserted into an array with a different type. For example, if an array has an integer data type, placing a float into the array results in the float being truncated and stored as an integer. This is dangerous, and so in most cases, arrays should be initialized to contain floats unless a considered decision is taken to use a different data type. >>> x = [0, 1, 2, 3, 4] # Integers >>> y = array(x) >>> y.dtype dtype('int32') >>> y = 3.141592 >>> y array([3, 1, 2, 3, 4]) >>> x = [0.0, 1, 2, 3, 4] # 1 Float makes all float >>> y = array(x) >>> y.dtype dtype('float64') >>> y = 3.141592 >>> y array([ 3.141592, 1. , 2. , 3. , 4. ]) 2 Formally, the array to be assigned must be broadcastable to the size of the slice. Broadcasting is described in Chapter 4, and assignment using broadcasting is discussed in Chapter 11. 3.7 Slicing and Memory Management 37 3.6.5 Linear Slicing using flat Data in arrays is stored in row-major order – elements are indexed by first counting across rows and then down columns. For example, in the 2-dimensional array 1 2 3 x= 4 5 6 7 8 9 the first element of x is 1, the second element is 2, the third is 3, the fourth is 4, and so on. In addition to slicing using the [:,:,...,:] syntax, k-dimensional arrays can be linear sliced. Linear slicing assigns an index to each element of the array, starting with the first (0), the second (1), and so on until the final element (n − 1). In 2-dimensions, linear slicing works by first counting across rows, and then down columns. To use linear slicing, the method or function flat must first be used. >>> y = reshape(arange(25.0),(5,5)) >>> y array([[ 0., 1., 2., 3., 4.], [ 5., 6., 7., 8., 9.], [ 10., 11., 12., 13., 14.], [ 15., 16., 17., 18., 19.], [ 20., 21., 22., 23., 24.]]) >>> y # Same as y[0,:], first row array([ 0., 1., 2., 3., 4.]) >>> y.flat # Scalar slice, flat is 1-dimensional 0 >>> y # Error IndexError: index out of bounds >>> y.flat # Element 6 6.0 >>> y.flat[12:15] array([ 12., 13., 14.]) >>> y.flat[:] # All element slice array([[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24.]]) Note that arange and reshape are useful functions are described in later chapters. 3.7 Slicing and Memory Management Unlike lists, slices of arrays do not copy the underlying data. Instead, a slice of an array returns a view of the array which shares the data in the sliced array. This is important since changes in slices will propagate to the original array as well as to any other slices which share the same element. >>> x = reshape(arange(4.0),(2,2)) >>> x array([[ 0., 1.], [ 2., 3.]]) >>> s1 = x[0,:] # First row 38 Arrays >>> s2 = x[:,0] # First column >>> s1 = -3.14 # Assign first element >>> s1 array([-3.14, 1. ]) >>> s2 array([-3.14, 2. ]) >>> x array([[-3.14, 1. ], [ 2. , 3. ]]) If changes should not propagate to parent and sibling arrays, it is necessary to call copy on the slice. Alterna- tively, they can also be copied by calling array on an existing array. >>> x = reshape(arange(4.0),(2,2)) >>> s1 = copy(x[0,:]) # Function copy >>> s2 = x[:,0].copy() # Method copy, more common >>> s3 = array(x[0,:]) # Create a new array >>> s1 = -3.14 >>> s1 array([-3.14, 1.]) >>> s2 array([ 0., 2.]) >>> s3 array([0., 1.]) >>> x[0,0] array([[ 0., 1.], [ 2., 3.]]) There is one notable exception to this rule – when using pure scalar selection the (scalar) value returned is always a copy. >>> x = arange(5.0) >>> y = x # Pure scalar selection >>> z = x[:1] # A pure slice >>> y = -3.14 >>> y # y Changes -3.14 >>> x # No propagation array([ 0., 1., 2., 3., 4.]) >>> z # No changes to z either array([ 0.]) >>> z = -2.79 >>> y # No propagation since y used pure scalar selection -3.14 >>> x # z is a view of x, so changes propagate array([-2.79, 1. , 2. , 3. , 4. ]) Finally, assignments from functions which change values will automatically create a copy of the underlying array. >>> x = array([[0.0, 1.0],[2.0,3.0]]) >>> y = x 3.8 import and Modules 39 >>> print(id(x),id(y)) # Same id, same object 129186368 129186368 >>> y = x + 1.0 >>> y array([[ 1., 2.], [ 3., 4.]]) >>> print(id(x),id(y)) # Different 129186368 129183104 >>> x # Unchanged array([[ 0., 1.], [ 2., 3.]]) >>> y = exp(x) >>> print(id(x),id(y)) # Also Different 129186368 129185120 Even trivial function such as y = x + 0.0 create a copy of x, and so the only scenario where explicit copying is required is when y is directly assigned using a slice of x, and changes to y should not propagate to x. 3.8 import and Modules Python, by default, only has access to a small number of built-in types and functions. The vast majority of functions are located in modules, and before a function can be accessed, the module which contains the function must be imported. For example, when using %pylab in an IPython session a large number of modules are automatically imported, including NumPy, SciPy, and matplotlib. While this style of importing useful for learning and interactive use, care is needed to make sure that the correct module is imported when designing more complex programs. For example, both NumPy and SciPy have functions called sqrt and so it is not clear which will be used by Pylab. import can be used in a variety of ways. The simplest is to use from module import ⁎ which imports all functions in module. This method of using import can dangerous since it is possible for functions in one module to be hidden by later imports from other modeuls. A better method is to only import the required functions. This still places functions at the top level of the namespace while preventing conflicts. from pylab import log2 # Will import log2 only from scipy import log10 # Will not import the log2 from SciPy The functions log2 and log10 can both be called in subsequent code. An alternative and more common method is to use import in the form import pylab import scipy import numpy which allows functions to be accessed using dot-notation and the module name, for example scipy.log2. It is also possible to rename modules when imported using as import pylab as pl import scipy as sp import numpy as np The only difference between the two types is that import scipy is implicitly calling import scipy as scipy. When this form of import is used, functions are used with the “as” name. For example, the square root provided by SciPy is accessed using sp.sqrt, while the pylab square root is pl.sqrt. Using this form of import allows both to be used where appropriate. 40 Arrays 3.9 Calling Functions Functions calls have different conventions than most other expressions. The most important difference is that functions can take more than one input and return more than one output. The generic structure of a function call is out1, out2, out3,... = functionname(in1, in2, in3,...). The important aspects of this structure are If multiple outputs are returned, but only one output variable is provided, the output will (generally) be a tuple. If more than one output variable is given in a function call, the number of output must match the number of output provided by the function. It is not possible to ask for two output if a function returns three – using an incorrect number of outputs results in ValueError: too many values to unpack. Both inputs and outputs must be separated by commas (,) Inputs can be the result of other functions. For example, the following are equivalent, >>> y = var(x) >>> mean(y) and >>> mean(var(x)) Required Arguments Most functions have required arguments. For example, consider the definition of array from help(array), array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0) Array has 1 required input, object, which is the list or tuple which contains values to use when creating the array. Required arguments can be determined by inspecting the function signature since all of the input follow the pattern keyword=default except object – required arguments will not have a default value provided. The other arguments can be called in order (array accepts at most 2 non-keyword arguments). >>> array([[1.0,2.0],[3.0,4.0]]) array([[ 1., 2.], [ 3., 4.]]) >>> array([[1.0,2.0],[3.0,4.0]], 'int32') array([[1, 2], [3, 4]]) Keyword Arguments All of the arguments to array can be called by the keyword that appears in the help file definition. array(object=[[1.0,2.0],[3.0,4.0]]) array([[1.0,2.0],[3.0,4.0]], dtype=None, copy=True, order=None, subok=False) Keyword arguments have two important advantages. First, they do not have to appear in any order (Note: randomly ordering arguments is not good practice, and this is only an example), and second, keyword arguments can be used only when needed since a default value is always given. >>> array(dtype='complex64', object = [[1.0,2.0],[3.0,4.0]], copy=True) array([[ 1.+0.j, 2.+0.j], [ 3.+0.j, 4.+0.j]], dtype=complex64) 3.10 Exercises 41 Default Arguments Functions have defaults for optional arguments. These are listed in the function definition and appear in the help in the form keyword=default. Returning to array, all inputs have default arguments except object – the only required input. Multiple Outputs Some functions can have more than 1 output. These functions can be used in a single output mode or in multiple output mode. For example, shape can be used on an array to determine the size of each dimension. >>> x = array([[1.0,2.0],[3.0,4.0]]) >>> s = shape(x) >>> s (2, 2) Since shape will return as many outputs as there are dimensions, it can be called with 2 outputs when the input is a 2-dimensional array. >>> x = array([[1.0,2.0],[3.0,4.0]]) >>> M, N = shape(x) >>> M 2 >>> N 2 Requesting more outputs than are required will produce an error. >>> M, N, P = shape(x) # Error ValueError: need more than 2 values to unpack Similarly, providing two few output can also produce an error. Consider the case where the argument used with shape is a 3-dimensional array. >>> x = randn(10,10,10) >>> shape(x) (10, 10, 10) >>> M, N = shape(x) # Error ValueError: too many values to unpack 3.10 Exercises 1. Input the following mathematical expressions into Python as arrays. u = [1 1 2 3 5 8] 1 1 2 v= 3 5 8 1 0 x= 0 1 42 Arrays 1 2 y= 3 4 1 2 1 2 z= 3 4 3 4 1 2 1 2 x x w=