Module 1 PDF - Basic Data Science Terminologies

#Basic terminologies of data science Data Science is a multi-disciplinary science with an objective to perform data analysis to generate knowledge that can be used for decision making. This knowledge can be in the form of similar patterns or predictive planning models, forecasting models etc. A data science application collects data and information from multiple heterogenous sources, cleans, integrates, processes and analyses this data using various tools and presents information and knowledge in various visual forms. Basic Terminologies of Data Science 1. Data Science:  Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights and knowledge from structured and unstructured data. 2. Data:  Data refers to raw facts and figures that are collected and stored. It can be in the form of text, numbers, images, audio, or any other format. 3. Big Data:  Big Data refers to datasets that are too large and complex for traditional data processing applications. It involves the processing of vast amounts of data using parallel and distributed computing. 4. Data Mining:  Data Mining is the process of discovering patterns and knowledge from large amounts of data. It involves various techniques such as clustering, classification, and regression. 5. Machine Learning:  Machine Learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit programming. 6. Artificial Intelligence (AI):  Artificial Intelligence is a broader concept that refers to the development of computer systems capable of performing tasks that normally require human intelligence. Machine Learning is a subset of AI. 7. Algorithm:  An algorithm is a step-by-step procedure or formula for solving a problem. In data science, algorithms are used to process and analyze data, making predictions or decisions. 8. Feature:  A feature is an individual measurable property or characteristic of a phenomenon being observed. In machine learning, features are the variables used to make predictions. 9. Model:  A model is a representation of a system or process that is used to make predictions or gain insights from data. In machine learning, models are trained on data to make predictions on new, unseen data. #Types of Data in Data Science Structured Data, Semi-Structured Data, Unstructured Data, Data Streams 1. Structured Data:  Definition: Structured data is highly organized and follows a well-defined schema. It fits neatly into relational databases or tables and is characterized by rows and columns.  Examples: SQL databases, Excel spreadsheets, CSV files.  Properties:  Clear schema.  Easily queryable.  Well-suited for traditional databases. 2. Unstructured Data:  Definition: Unstructured data lacks a predefined data model or structure. It does not conform to a tabular format and can include text, images, audio, and video.  Examples: Text documents, social media posts, images, videos.  Properties:  No clear structure.  Requires advanced processing techniques (NLP, computer vision).  Abundant in real-world scenarios. 3. Semi-Structured Data:  Definition: Semi-structured data has some organizational properties of structured data but doesn't fit neatly into a relational database. It may have tags, keys, or hierarchical structures.  Examples: JSON (JavaScript Object Notation), XML (eXtensible Markup Language).  Properties:  Some level of structure.  Flexible and adaptable.  Common in web applications and APIs. 4. Data Streams:  Definition: Data streams involve continuous and real-time data that is generated and processed without interruption. They are often used for monitoring, analytics, and decision-making in real- time.  Examples: Sensor data, financial market data, social media updates.  Properties:  Constant flow of data.  Time-sensitive.  Requires efficient processing and analysis. Key Considerations:  Use Cases:  Structured data is suitable for scenarios with a clear schema and where relationships between data points are well-defined.  Unstructured data is prevalent in natural language processing, image recognition, and multimedia analysis.  Semi-structured data is flexible and commonly used in web development and data interchange.  Processing Techniques:  Structured data can be processed using traditional SQL queries.  Unstructured data often requires advanced processing techniques such as machine learning algorithms and computer vision.  Semi-structured data may be handled using document-based databases or NoSQL databases.  Challenges:  Handling the volume and complexity of structured data can be a challenge.  Extracting meaningful information from unstructured data requires sophisticated algorithms.  Managing the flexibility of semi-structured data can be complex.  Real-Time Analytics:  Data streams are essential for real-time analytics, enabling quick decision-making based on up-to-the-moment information.  Structured, unstructured, and semi-structured data may also be processed in real-time, but data streams are specifically designed for continuous, high-velocity data. Understanding these data types is crucial for data scientists, as different types of data require different approaches for processing, analysis, and extraction of valuable insights. # Five Steps of Data Science 1. Asking an Interesting Question:  Purpose:  Data science starts with formulating a clear and relevant question or problem to solve. This step is crucial in guiding the entire data science process.  Key Actions:  Identify the problem or goal.  Formulate specific, answerable questions.  Define success criteria. 2. Obtaining the Data:  Purpose:  Acquiring relevant and sufficient data to address the questions posed in the first step. This involves collecting, accessing, or retrieving the necessary datasets.  Key Actions:  Identify data sources.  Collect or access the data.  Ensure data quality and reliability. 3. Exploring the Data:  Purpose:  Understand the characteristics and patterns within the data. Exploratory Data Analysis (EDA) is conducted to gain insights and inform subsequent steps.  Key Actions:  Descriptive statistics.  Data visualization.  Identify outliers and missing values. 4. Modeling the Data:  Purpose:  Building a predictive model or statistical analysis to answer the questions posed initially. This step involves selecting appropriate algorithms and techniques.  Key Actions:  Feature engineering.  Selecting and training models.  Evaluating model performance. 5. Communicating and Visualizing the Results:  Purpose:  Effectively communicate findings and insights to both technical and non-technical stakeholders. Visualization plays a key role in conveying complex information.  Key Actions:  Prepare a clear and concise report.  Create visualizations to support key findings.  Communicate results to stakeholders. Key Considerations:  Iteration:  The data science process is often iterative. As new insights emerge or additional questions arise, the process may cycle back to earlier steps.  Interdisciplinary Collaboration:  Effective data science often involves collaboration between data scientists, domain experts, and stakeholders to ensure the relevance and accuracy of results.  Ethical Considerations:  Throughout the process, ethical considerations should be addressed, including data privacy, bias, and the responsible use of algorithms.  Tools and Technologies:  Various tools and programming languages (such as Python, R, and SQL) are used at different stages of the data science process.  Validation and Reproducibility:  Rigorous validation of results and ensuring that the analysis can be reproduced are critical aspects of maintaining the integrity of the data science workflow. The five steps of data science provide a systematic approach for turning raw data into actionable insights. Each step contributes to the overall process of extracting knowledge and value from data, facilitating informed decision-making in various domains. Python for Data Analysis, 3E  Chapters > 4 NumPy Basics: Arrays and Vectorized Computation  4 NumPy Basics: Arrays and Vectorized Computation This Open Access web version of Python for Data Analysis 3rd Edition is now available as a companion to the print and digital editions. If you encounter any errata, please report them here. Please note that some aspects of this site as produced by Quarto will differ from the formatting of the print and eBook versions from O’Reilly. If you find the online edition of the book useful, please consider ordering a paper copy or a DRM-free eBook to support the author. The content from this website may not be copied or reproduced. The code examples are MIT licensed and can be found on GitHub or Gitee. NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy's array objects as one of the standard interface lingua francas for data exchange. Much of the knowledge about NumPy that I cover is transferable to pandas as well. Here are some of the things you'll find in NumPy: ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities Mathematical functions for fast operations on entire arrays of data without having to write loops Tools for reading/writing array data to disk and working with memory-mapped files Linear algebra, random number generation, and Fourier transform capabilities A C API for connecting NumPy with libraries written in C, C++, or FORTRAN Because NumPy provides a comprehensive and well-documented C API, it is straightforward to pass data to external libraries written in a low-level language, and for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C, C++, or FORTRAN codebases and giving them a dynamic and accessible interface. While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array- oriented computing will help you use tools with array computing semantics, like pandas, much more effectively. Since NumPy is a large topic, I will cover many advanced NumPy features like broadcasting in more depth later (see Appendix A: Advanced NumPy). Many of these advanced features are not needed to follow the rest of this book, but they may help you as you go deeper into scientific computing in Python. For most data analysis applications, the main areas of functionality I’ll focus on are: Fast array-based operations for data munging and cleaning, subsetting and filtering, transformation, and any other kind of computation Common array algorithms like sorting, unique, and set operations Efficient descriptive statistics and aggregating/summarizing data Data alignment and relational data manipulations for merging and joining heterogeneous datasets Expressing conditional logic as array expressions instead of loops with if-elif-else branches Group-wise data manipulations (aggregation, transformation, and function application) While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular data. Also, pandas provides some more domain-specific functionality like time series manipulation, which is not present in NumPy. Note Array-oriented computing in Python traces its roots back to 1995, when Jim Hugunin created the Numeric library. Over the next 10 years, many scientific programming communities began doing array programming in Python, but the library ecosystem had become fragmented in the early 2000s. In 2005, Travis Oliphant was able to forge the NumPy project from the then Numeric and Numarray projects to bring the community together around a single array computing framework. One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this: NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy's library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences. NumPy operations perform complex computations on entire arrays without the need for Python for loops, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code. To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list: In : import numpy as np In : my_arr = np.arange(1_000_000) In : my_list = list(range(1_000_000)) Now let's multiply each sequence by 2: In : %timeit my_arr2 = my_arr * 2 309 us +- 7.48 us per loop (mean +- std. dev. of 7 runs, 1000 loops each) In : %timeit my_list2 = [x * 2 for x in my_list] 46.4 ms +- 526 us per loop (mean +- std. dev. of 7 runs, 10 loops each) NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory. 4.1 The NumPy ndarray: A Multidimensional Array Object One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and create a small array: In : import numpy as np In : data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]]) In : data Out: array([[ 1.5, -0.1, 3. ], [ 0. , -3. , 6.5]]) I then write mathematical operations with data : In : data * 10 Out: array([[ 15., -1., 30.], [ 0., -30., 65.]]) In : data + data Out: array([[ 3. , -0.2, 6. ], [ 0. , -6. , 13. ]]) In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each "cell" in the array have been added to each other. Note In this chapter and throughout the book, I use the standard NumPy convention of always using import numpy as np. It would be possible to put from numpy import * in your code to avoid having to write np. , but I advise against making a habit of this. The numpy namespace is large and contains a number of functions whose names conflict with built-in Python functions (like min and max ). Following standard conventions like these is almost always a good idea. An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape , a tuple indicating the size of each dimension, and a dtype , an object describing the data type of the array: In : data.shape Out: (2, 3) In : data.dtype Out: dtype('float64') This chapter will introduce you to the basics of using NumPy arrays, and it should be sufficient for following along with the rest of the book. While it’s not necessary to have a deep understanding of NumPy for many data analytical applications, becoming proficient in array-oriented programming and thinking is a key step along the way to becoming a scientific Python guru. Note Whenever you see “array,” “NumPy array,” or “ndarray” in the book text, in most cases they all refer to the ndarray object. Creating ndarrays The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion: In : data1 = [6, 7.5, 8, 0, 1] In : arr1 = np.array(data1) In : arr1 Out: array([6. , 7.5, 8. , 0. , 1. ]) Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array: In : data2 = [[1, 2, 3, 4], [5, 6, 7, 8]] In : arr2 = np.array(data2) In : arr2 Out: array([[1, 2, 3, 4], [5, 6, 7, 8]]) Since data2 was a list of lists, the NumPy array arr2 has two dimensions, with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes: In : arr2.ndim Out: 2 In : arr2.shape Out: (2, 4) Unless explicitly specified (discussed in Data Types for ndarrays), numpy.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have: In : arr1.dtype Out: dtype('float64') In : arr2.dtype Out: dtype('int64') In addition to numpy.array , there are a number of other functions for creating new arrays. As examples, numpy.zeros and numpy.ones create arrays of 0s or 1s, respectively, with a given length or shape. numpy.empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape: In : np.zeros(10) Out: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) In : np.zeros((3, 6)) Out: array([[0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.]]) In : np.empty((2, 3, 2)) Out: array([[[0., 0.], [0., 0.], [0., 0.]], [[0., 0.], [0., 0.], [0., 0.]]]) Caution It’s not safe to assume that numpy.empty will return an array of all zeros. This function returns uninitialized memory and thus may contain nonzero "garbage" values. You should use this function only if you intend to populate the new array with data. numpy.arange is an array-valued version of the built-in Python range function: In : np.arange(15) Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) See Table 4.1 for a short list of standard array creation functions. Since NumPy is focused on numerical computing, the data type, if not specified, will in many cases be float64 (floating point). Table 4.1: Some important NumPy array creation functions Function Description array Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a data type or explicitly specifying a data type; copies the input data by default asarray Convert input to ndarray, but do not copy if the input is already an ndarray arange Like the built-in range but returns an ndarray instead of a list ones, Produce an array of all 1s with the given shape and data type; ones_like takes another array and produces a ones_like ones array of the same shape and data type Function Description zeros, Like ones and ones_like but producing arrays of 0s instead zeros_like empty, Create new arrays by allocating new memory, but do not populate with any values like ones and zeros empty_like full, Produce an array of the given shape and data type with all values set to the indicated "fill value"; full_like full_like takes another array and produces a filled array of the same shape and data type eye, identity Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere) Data Types for ndarrays The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data: In : arr1 = np.array([1, 2, 3], dtype=np.float64) In : arr2 = np.array([1, 2, 3], dtype=np.int32) In : arr1.dtype Out: dtype('float64') In : arr2.dtype Out: dtype('int32') Data types are a source of NumPy's flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, which makes it possible to read and write binary streams of data to disk and to connect to code written in a low-level language like C or FORTRAN. The numerical data types are named the same way: a type name, like float or int , followed by a number indicating the number of bits per element. A standard double- precision floating-point value (what’s used under the hood in Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64. See Table 4.2 for a full listing of NumPy’s supported data types. Note Don’t worry about memorizing the NumPy data types, especially if you’re a new user. It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, Boolean, string, or general Python object. When you need more control over how data is stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type. Table 4.2: NumPy data types Type Type code Description int8, uint8 i1, u1 Signed and unsigned 8-bit (1 byte) integer types int16, uint16 i2, u2 Signed and unsigned 16-bit integer types int32, uint32 i4, u4 Signed and unsigned 32-bit integer types int64, uint64 i8, u8 Signed and unsigned 64-bit integer types float16 f2 Half-precision floating point float32 f4 or f Standard single-precision floating point; compatible with C float float64 f8 or d Standard double-precision floating point; compatible with C double and Python float object float128 f16 or g Extended-precision floating point Type Type code Description complex64 , complex128 , c8, c16, Complex numbers represented by two 32, 64, or 128 floats, respectively complex256 c32 bool ? Boolean type storing True and False values object O Python object type; a value can be any Python object string_ S Fixed-length ASCII string type (1 byte per character); for example, to create a string data type with length 10, use 'S10' unicode_ U Fixed-length Unicode type (number of bytes platform specific); same specification semantics as string_ (e.g., 'U10' ) Note There are both signed and unsigned integer types, and many readers will not be familiar with this terminology. A signed integer can represent both positive and negative integers, while an unsigned integer can only represent nonzero integers. For example, int8 (signed 8-bit integer) can represent integers from -128 to 127 (inclusive), while uint8 (unsigned 8-bit integer) can represent 0 through 255. You can explicitly convert or cast an array from one data type to another using ndarray’s astype method: In : arr = np.array([1, 2, 3, 4, 5]) In : arr.dtype Out: dtype('int64') In : float_arr = arr.astype(np.float64) In : float_arr Out: array([1., 2., 3., 4., 5.]) In : float_arr.dtype Out: dtype('float64') In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated: In : arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1]) In : arr Out: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1]) In : arr.astype(np.int32) Out: array([ 3, -1, -2, 0, 12, 10], dtype=int32) If you have an array of strings representing numbers, you can use astype to convert them to numeric form: In : numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_) In : numeric_strings.astype(float) Out: array([ 1.25, -9.6 , 42. ]) Caution Be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data. If casting were to fail for some reason (like a string that cannot be converted to float64 ), a ValueError will be raised. Before, I was a bit lazy and wrote float instead of np.float64 ; NumPy aliases the Python types to its own equivalent data types. You can also use another array’s dtype attribute: In : int_array = np.arange(10) In : calibers = np.array([.22,.270,.357,.380,.44,.50], dtype=np.float64) In : int_array.astype(calibers.dtype) Out: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) There are shorthand type code strings you can also use to refer to a dtype : In : zeros_uint32 = np.zeros(8, dtype="u4") In : zeros_uint32 Out: array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32) Note Calling astype always creates a new array (a copy of the data), even if the new data type is the same as the old data type. Arithmetic with NumPy Arrays Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise: In : arr = np.array([[1., 2., 3.], [4., 5., 6.]]) In : arr Out: array([[1., 2., 3.], [4., 5., 6.]]) In : arr * arr Out: array([[ 1., 4., 9.], [16., 25., 36.]]) In : arr - arr Out: array([[0., 0., 0.], [0., 0., 0.]]) Arithmetic operations with scalars propagate the scalar argument to each element in the array: In : 1 / arr Out: array([[1. , 0.5 , 0.3333], [0.25 , 0.2 , 0.1667]]) In : arr ** 2 Out: array([[ 1., 4., 9.], [16., 25., 36.]]) Comparisons between arrays of the same size yield Boolean arrays: In : arr2 = np.array([[0., 4., 1.], [7., 2., 12.]]) In : arr2 Out: array([[ 0., 4., 1.], [ 7., 2., 12.]]) In : arr2 > arr Out: array([[False, True, False], [ True, False, True]]) Evaluating operations between differently sized arrays is called broadcasting and will be discussed in more detail in Appendix A: Advanced NumPy. Having a deep understanding of broadcasting is not necessary for most of this book. Basic Indexing and Slicing NumPy array indexing is a deep topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists: In : arr = np.arange(10) In : arr Out: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In : arr Out: 5 In : arr[5:8] Out: array([5, 6, 7]) In : arr[5:8] = 12 In : arr Out: array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9]) As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12 , the value is propagated (or broadcast henceforth) to the entire selection. Note An important first distinction from Python's built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array. To give an example of this, I first create a slice of arr : In : arr_slice = arr[5:8] In : arr_slice Out: array([12, 12, 12]) Now, when I change values in arr_slice , the mutations are reflected in the original array arr : In : arr_slice = 12345 In : arr Out: array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8, 9]) The "bare" slice [:] will assign to all values in an array: In : arr_slice[:] = 64 In : arr Out: array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9]) If you are new to NumPy, you might be surprised by this, especially if you have used other array programming languages that copy data more eagerly. As NumPy has been designed to be able to work with very large arrays, you could imagine performance and memory problems if NumPy insisted on always copying data. Caution If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy(). As you will see, pandas works this way, too. With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays: In : arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In : arr2d Out: array([7, 8, 9]) Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent: In : arr2d Out: 3 In : arr2d[0, 2] Out: 3 See Figure 4.1 for an illustration of indexing on a two-dimensional array. I find it helpful to think of axis 0 as the "rows" of the array and axis 1 as the "columns." Figure 4.1: Indexing elements in a NumPy array In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array arr3d : In : arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) In : arr3d Out: array([[[ 1, 2, 3], [ 4, 5, 6]], [[ 7, 8, 9], [10, 11, 12]]]) arr3d is a 2 × 3 array: In : arr3d Out: array([[1, 2, 3], [4, 5, 6]]) Both scalar values and arrays can be assigned to arr3d : In : old_values = arr3d.copy() In : arr3d = 42 In : arr3d Out: array([[[42, 42, 42], [42, 42, 42]], [[ 7, 8, 9], [10, 11, 12]]]) In : arr3d = old_values In : arr3d Out: array([[[ 1, 2, 3], [ 4, 5, 6]], [[ 7, 8, 9], [10, 11, 12]]]) Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0) , forming a one-dimensional array: In : arr3d[1, 0] Out: array([7, 8, 9]) This expression is the same as though we had indexed in two steps: In : x = arr3d In : x Out: array([[ 7, 8, 9], [10, 11, 12]]) In : x Out: array([7, 8, 9]) Note that in all of these cases where subsections of the array have been selected, the returned arrays are views. Caution This multidimensional indexing syntax for NumPy arrays will not work with regular Python objects, such as lists of lists. Indexing with slices Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax: In : arr Out: array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9]) In : arr[1:6] Out: array([ 1, 2, 3, 4, 64]) Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different: In : arr2d Out: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In : arr2d[:2] Out: array([[1, 2, 3], [4, 5, 6]]) As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression arr2d[:2] as "select the first two rows of arr2d." You can pass multiple slices just like you can pass multiple indexes: In : arr2d[:2, 1:] Out: array([[2, 3], [5, 6]]) When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices. For example, I can select the second row but only the first two columns, like so: In : lower_dim_slice = arr2d[1, :2] Here, while arr2d is two-dimensional, lower_dim_slice is one-dimensional, and its shape is a tuple with one axis size: In : lower_dim_slice.shape Out: (2,) Similarly, I can select the third column but only the first two rows, like so: In : arr2d[:2, 2] Out: array([3, 6]) See Figure 4.2 for an illustration. Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing: In : arr2d[:, :1] Out: array([, , ]) Of course, assigning to a slice expression assigns to the whole selection: In : arr2d[:2, 1:] = 0 In : arr2d Out: array([[1, 0, 0], [4, 0, 0], [7, 8, 9]]) Figure 4.2: Two-dimensional array slicing Boolean Indexing Let’s consider an example where we have some data in an array and an array of names with duplicates: In : names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"]) In : data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2],.....: [-12, -4], [3, 4]]) In : names Out: array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype=', >=, = 10).argmax() Out: 155 Note that using argmax here is not always efficient because it always makes a full scan of the array. In this special case, once a True is observed we know it to be the maximum value. Simulating Many Random Walks at Once If your goal was to simulate many random walks, say five thousand of them, you can generate all of the random walks with minor modifications to the preceding code. If passed a 2-tuple, the numpy.random functions will generate a two-dimensional array of draws, and we can compute the cumulative sum for each row to compute all five thousand random walks in one shot: In : nwalks = 5000 In : nsteps = 1000 In : draws = rng.integers(0, 2, size=(nwalks, nsteps)) # 0 or 1 In : steps = np.where(draws > 0, 1, -1) In : walks = steps.cumsum(axis=1) In : walks Out: array([[ 1, 2, 3,..., 22, 23, 22], [ 1, 0, -1,..., -50, -49, -48], [ 1, 2, 3,..., 50, 49, 48],..., [ -1, -2, -1,..., -10, -9, -10], [ -1, -2, -3,..., 8, 9, 8], [ -1, 0, 1,..., -4, -3, -2]]) Now, we can compute the maximum and minimum values obtained over all of the walks: In : walks.max() Out: 114 In : walks.min() Out: -120 Out of these walks, let’s compute the minimum crossing time to 30 or –30. This is slightly tricky because not all 5,000 of them reach 30. We can check this using the any method: In : hits30 = (np.abs(walks) >= 30).any(axis=1) In : hits30 Out: array([False, True, True,..., True, False, True]) In : hits30.sum() # Number that hit 30 or -30 Out: 3395 We can use this Boolean array to select the rows of walks that actually cross the absolute 30 level, and call argmax across axis 1 to get the crossing times: In : crossing_times = (np.abs(walks[hits30]) >= 30).argmax(axis=1) In : crossing_times Out: array([201, 491, 283,..., 219, 259, 541]) Lastly, we compute the average minimum crossing time: In : crossing_times.mean() Out: 500.5699558173785 Feel free to experiment with other distributions for the steps other than equal-sized coin flips. You need only use a different random generator method, like standard_normal to generate normally distributed steps with some mean and standard deviation: In : draws = 0.25 * rng.standard_normal((nwalks, nsteps)) Note Keep in mind that this vectorized approach requires creating an array with nwalks * nsteps elements, which may use a large amount of memory for large simulations. If memory is more constrained, then a different approach will be required. 4.8 Conclusion While much of the rest of the book will focus on building data wrangling skills with pandas, we will continue to work in a similar array-based style. In Appendix A: Advanced NumPy, we will dig deeper into NumPy features to help you further develop your array computing skills. Copyright 2023, Wes McKinney. All Rights Reserved. 

Module 1 PDF - Basic Data Science Terminologies

Document Details

Tags

Related

Summary

Full Transcript