Chapter1_Introduction_to_Machine_Learning.pdf

Full Transcript

Chapter 1 Introduction to Machine Learning Anaconda & Jupyter Installation Add Anaconda & Jupyter Installation Link Windows Mac Conda Install Seaborn Python Packages and Libraries Jupyter notebooks: interactive coding and visualization of...

Chapter 1 Introduction to Machine Learning Anaconda & Jupyter Installation Add Anaconda & Jupyter Installation Link Windows Mac Conda Install Seaborn Python Packages and Libraries Jupyter notebooks: interactive coding and visualization of output NumPy, SciPy, Pandas: numerical computation Matplotlib, Seaborn: data visualization Scikit-learn: machine learning Introduction to Jupyter Notebook Polyglot analysis environment—blends multiple languages Jupyter is an anagram of: Julia, Python, and R Supports multiple content types: code, narrative text, images, movies, etc. Source: http://jupyter.org/ Introduction to Jupyter Notebook Polyglot analysis environment—blends multiple languages Jupyter is an anagram of: Julia, Python, and R Supports multiple content types: code, narrative text, images, movies, etc. Source: http://jupyter.org/ Introduction to Jupyter Notebook Polyglot analysis environment—blends multiple languages Jupyter is an anagram of: Julia, Python, and R Supports multiple content types: code, narrative text, images, movies, etc. Source: http://jupyter.org/ Introduction to Jupyter Notebook HTML & Markdown LaTeX (equations) Code Source: http://jupyter.org/ Introduction to Jupyter Notebook HTML & Markdown LaTeX (equations) Code Source: http://jupyter.org/ Introduction to Jupyter Notebook HTML & Markdown LaTeX (equations) Code Source: http://jupyter.org/ Introduction to Jupyter Notebook HTML & Markdown LaTeX (equations) Code Source: http://jupyter.org/ Introduction to Jupyter Notebook Code is divided into cells to control execution Enables interactive development Ideal for exploratory analysis and model building Introduction to Jupyter Notebook Code is divided into cells to control execution Enables interactive development Ideal for exploratory analysis and model building Jupyter Cell Magics %matplotlib inline: display plots inline in Jupyter notebook %%timeit: time how long a cell takes to execute %run filename.ipynb: execute code from another notebook or python file Jupyter Cell Magics %matplotlib inline: display plots inline in Jupyter notebook %%timeit: time how long a cell takes to execute %run filename.ipynb: execute code from another notebook or python file Jupyter Cell Magics %matplotlib inline: display plots inline in Jupyter notebook %%timeit: time how long a cell takes to execute %run filename.ipynb: execute code from another notebook or python file Jupyter Cell Magics %matplotlib inline: display plots inline in Jupyter notebook %%timeit: time how long a cell takes to execute Jupyter Cell Magics %matplotlib inline: display plots inline in Jupyter notebook %%timeit: time how long a cell takes to execute %run filename.ipynb: execute code from another notebook or python file Jupyter Cell Magics %matplotlib inline: display plots inline in Jupyter notebook %%timeit: time how long a cell takes to execute %run filename.ipynb: execute code from another notebook or python file %load filename.py: copy contents of the file and paste into the cell Jupyter Keyboard Shortcuts Keyboard shortcuts can be viewed from Help → Keyboard Shortcuts Making Jupyter Notebooks Reusable To extract Python code from a Jupyter notebook: Convert from Command Line Export from within notebook >>> jupyter nbconvert --to python notebook.ipynb Making Jupyter Notebooks Reusable To extract Python code from a Jupyter notebook: Convert from Command Line Export from Notebook >>> jupyter nbconvert --to python notebook.ipynb Introduction to Pandas Library for computation with tabular data Mixed types of data allowed in a single table Columns and rows of data can be named Advanced data aggregation and statistical functions Source: http://pandas.pydata.org/ Introduction to Pandas Basic data structures Type Pandas Name Vector Series (1 Dimension) Array DataFrame (2 Dimensions) Introduction to Pandas Basic data structures Type Pandas Name Vector Series (1 Dimension) Array DataFrame (2 Dimensions) Pandas Series Creation and Indexing Use data from step tracking application to create a Pandas Series Code Output import pandas as pd >>> 0 3620 1 7891 step_data = [3620, 7891, 9761, 2 9761 3907, 4338, 5373] 3 3907 4 4338 step_counts = pd.Series(step_data, 5 5373 name='steps') Name: steps, dtype: int64 print(step_counts) Pandas Series Creation and Indexing Use data from step tracking application to create a Pandas Series Code Output import pandas as pd >>> 0 3620 1 7891 step_data = [3620, 7891, 9761, 2 9761 3907, 4338, 5373] 3 3907 4 4338 step_counts = pd.Series(step_data, 5 5373 name='steps') Name: steps, dtype: int64 print(step_counts) Pandas Series Creation and Indexing Add a date range to the Series Code Output step_counts.index = pd.date_range('20150329', >>> 2015-03-29 3620 periods=6) 2015-03-30 7891 2015-03-31 9761 print(step_counts) 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Add a date range to the Series Code Output step_counts.index = pd.date_range('20150329', >>> 2015-03-29 3620 periods=6) 2015-03-30 7891 2015-03-31 9761 print(step_counts) 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Select data by the index values Code Output # Just like a dictionary print(step_counts['2015-04-01']) >>> 3907 # Or by index position--like an array >>> 3907 print(step_counts) # Select all of April >>> 2015-04-01 3907 print(step_counts['2015-04']) 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Select data by the index values Code Output # Just like a dictionary print(step_counts['2015-04-01']) >>> 3907 # Or by index position--like an array >>> 3907 print(step_counts) # Select all of April >>> 2015-04-01 3907 print(step_counts['2015-04']) 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Select data by the index values Code Output # Just like a dictionary print(step_counts['2015-04-01']) >>> 3907 # Or by index position--like an array >>> 3907 print(step_counts) # Select all of April >>> 2015-04-01 3907 print(step_counts['2015-04']) 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Select data by the index values Code Output # Just like a dictionary print(step_counts['2015-04-01']) >>> 3907 # Or by index position--like an array >>> 3907 print(step_counts) # Select all of April >>> 2015-04-01 3907 print(step_counts['2015-04']) 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Select data by the index values Code Output # Just like a dictionary print(step_counts['2015-04-01']) >>> 3907 # Or by index position--like an array >>> 3907 print(step_counts) # Select all of April >>> 2015-04-01 3907 print(step_counts['2015-04']) 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Series Creation and Indexing Select data by the index values Code Output # Just like a dictionary print(step_counts['2015-04-01']) >>> 3907 # Or by index position--like an array >>> 3907 print(step_counts) # Select all of April >>> 2015-04-01 3907 print(step_counts['2015-04']) 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: steps, dtype: int64 Pandas Data Types and Imputation Data types can be viewed and converted Code Output # View the data type print(step_counts.dtypes) >>> int64 # Convert to a float step_counts = step_counts.astype(np.float) # View the data type >>> float64 print(step_counts.dtypes) Pandas Data Types and Imputation Data types can be viewed and converted Code Output # View the data type print(step_counts.dtypes) >>> int64 # Convert to a float step_counts = step_counts.astype(np.float) # View the data type >>> float64 print(step_counts.dtypes) Pandas Data Types and Imputation Data types can be viewed and converted Code Output # View the data type print(step_counts.dtypes) >>> int64 # Convert to a float step_counts = step_counts.astype(np.float) # View the data type >>> float64 print(step_counts.dtypes) Pandas Data Types and Imputation Data types can be viewed and converted Code Output # View the data type print(step_counts.dtypes) >>> int64 # Convert to a float step_counts = step_counts.astype(np.float) # View the data type >>> float64 print(step_counts.dtypes) Pandas Data Types and Imputation Invalid data points can be easily filled with values Code Output # Create invalid data step_counts[1:3] = np.NaN >>> 2015-03-30 0.0 2015-03-31 0.0 # Now fill it in with zeros Freq: D, Name: steps, step_counts = step_counts.fillna(0.) dtype: float64 # equivalently, # step_counts.fillna(0., inplace=True) print(step_counts[1:3]) Pandas Data Types and Imputation Invalid data points can be easily filled with values Code Output # Create invalid data step_counts[1:3] = np.NaN >>> 2015-03-30 0.0 2015-03-31 0.0 # Now fill it in with zeros Freq: D, Name: steps, step_counts = step_counts.fillna(0.) dtype: float64 # equivalently, # step_counts.fillna(0., inplace=True) print(step_counts[1:3]) Pandas DataFrame Creation and Methods DataFrames can be created from lists, dictionaries, and Pandas Series Code Output # Cycling distance cycling_data = [10.7, 0, None, 2.4, 15.3, >>> 10.9, 0, None] # Create a tuple of data joined_data = list(zip(step_data, cycling_data)) # The dataframe activity_df = pd.DataFrame(joined_data) print(activity_df) Pandas DataFrame Creation and Methods DataFrames can be created from lists, dictionaries, and Pandas Series Code Output # Cycling distance cycling_data = [10.7, 0, None, 2.4, 15.3, >>> 10.9, 0, None] # Create a tuple of data joined_data = list(zip(step_data, cycling_data)) # The dataframe activity_df = pd.DataFrame(joined_data) print(activity_df) Pandas DataFrame Creation and Methods Labeled columns and an index can be added Code Output # Add column names to dataframe activity_df = pd.DataFrame( >>> joined_data, index=pd.date_range('20150329', periods=6), columns=['Walking','Cycling']) print(activity_df) Pandas DataFrame Creation and Methods Labeled columns and an index can be added Code Output # Add column names to dataframe activity_df = pd.DataFrame(joined_data, >>> index=pd.date_range('20150329', periods=6), columns=['Walking','Cycling']) print(activity_df) Indexing DataFrame Rows DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods Code Output # Select row of data by index name print(activity_df.loc['2015-04-01']) >>> Walking 3907.0 Cycling 2.4 Name: 2015-04-01, dtype: float64 Indexing DataFrame Rows DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods Code Output # Select row of data by index name print(activity_df.loc['2015-04-01']) >>> Walking 3907.0 Cycling 2.4 Name: 2015-04-01, dtype: float64 Indexing DataFrame Rows DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods Code Output # Select row of data by integer position print(activity_df.iloc[-3]) >>> Walking 3907.0 Cycling 2.4 Name: 2015-04-01, dtype: float64 Indexing DataFrame Rows DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods Code Output # Select row of data by integer position print(activity_df.iloc[-3]) >>> Walking 3907.0 Cycling 2.4 Name: 2015-04-01, dtype: float64 Indexing DataFrame Columns DataFrame columns can be indexed by name Code Output # Name of column print(activity_df['Walking']) >>> 2015-03-29 3620 2015-03-30 7891 2015-03-31 9761 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: Walking, dtype: int64 Indexing DataFrame Columns DataFrame columns can be indexed by name Code Output # Name of column print(activity_df['Walking']) >>> 2015-03-29 3620 2015-03-30 7891 2015-03-31 9761 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: Walking, dtype: int64 Indexing DataFrame Columns DataFrame columns can also be indexed as properties Code Output # Object-oriented approach print(activity_df.Walking) >>> 2015-03-29 3620 2015-03-30 7891 2015-03-31 9761 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: Walking, dtype: int64 Indexing DataFrame Columns DataFrame columns can also be indexed as properties Code Output # Object-oriented approach print(activity_df.Walking) >>> 2015-03-29 3620 2015-03-30 7891 2015-03-31 9761 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: Walking, dtype: int64 Indexing DataFrame Columns DataFrame columns can be indexed by integer Code Output # First column print(activity_df.iloc[:,0]) >>> 2015-03-29 3620 2015-03-30 7891 2015-03-31 9761 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: Walking, dtype: int64 Indexing DataFrame Columns DataFrame columns can be indexed by integer Code Output # First column print(activity_df.iloc[:,0]) >>> 2015-03-29 3620 2015-03-30 7891 2015-03-31 9761 2015-04-01 3907 2015-04-02 4338 2015-04-03 5373 Freq: D, Name: Walking, dtype: int64 Reading Data with Pandas CSV and other common filetypes can be read with a single command Code Output # The location of the data file filepath = 'data/Iris_Data/Iris_Data.csv' >>> # Import the data data = pd.read_csv(filepath) # Print a few rows print(data.iloc[:5]) Reading Data with Pandas CSV and other common filetypes can be read with a single command Code Output # The location of the data file filepath = 'data/Iris_Data/Iris_Data.csv' >>> # Import the data data = pd.read_csv(filepath) # Print a few rows print(data.iloc[:5]) Assigning New Data to a DataFrame Data can be (re-)assigned to a DataFrame column Code Output # Create a new column that is a product # of both measurements >>> data['sepal_area'] = data.sepal_length * data.sepal_width # Print a few rows and columns print(data.iloc[:5, -3:]) Assigning New Data to a DataFrame Data can be (re-)assigned to a DataFrame column Code Output # Create a new column that is a product # of both measurements >>> data['sepal_area'] = data.sepal_length * data.sepal_width # Print a few rows and columns print(data.iloc[:5, -3:]) Applying a Function to a DataFrame Column Functions can be applied to columns or rows of a DataFrame or Series Code Output # The lambda function applies what # follows it to each row of data >>> data['abbrev'] = (data.species.apply(lambda x: x.replace('Iris-',''))) # Note that there are other ways to # accomplish the above print(data.iloc[:5, -3:]) Applying a Function to a DataFrame Column Functions can be applied to columns or rows of a DataFrame or Series Code Output # The lambda function applies what # follows it to each row of data >>> data['abbrev'] = (data.species.apply(lambda x: x.replace('Iris-',''))) # Note that there are other ways to # accomplish the above print(data.iloc[:5, -3:]) Concatenating Two DataFrames Two DataFrames can be concatenated along either dimension Code Output # Concatenate the first two and # last two rows >>> small_data = pd.concat([data.iloc[:2], data.iloc[-2:]]) print(small_data.iloc[:,-3:]) # See the 'join' method for # SQL style joining of dataframes Concatenating Two DataFrames Two DataFrames can be concatenated along either dimension Code Output # Concatenate the first two and # last two rows >>> small_data = pd.concat([data.iloc[:2], data.iloc[-2:]]) print(small_data.iloc[:,-3:]) # See the 'join' method for # SQL style joining of dataframes Aggregated Statistics with GroupBy Using the groupby method calculated aggregated DataFrame statistics Code Output # Use the size method with a # DataFrame to get count >>> species # For a Series, use the.value_counts Iris-setosa 50 # method Iris-versicolor 50 group_sizes = (data Iris-virginica 50.groupby('species') dtype: int64.size()) print(group_sizes) Aggregated Statistics with GroupBy Using the groupby method calculated aggregated DataFrame statistics Code Output # Use the size method with a # DataFrame to get count >>> species # For a Series, use the.value_counts Iris-setosa 50 # method Iris-versicolor 50 group_sizes = (data Iris-virginica 50.groupby('species') dtype: int64.size()) print(group_sizes) Performing Statistical Calculations Pandas contains a variety of statistical methods—mean, median, and mode Code Output # Mean calculated on a DataFrame print(data.mean()) >>> sepal_length 5.843333 sepal_width 3.054000 petal_length 3.758667 petal_width 1.198667 dtype: float64 # Median calculated on a Series print(data.petal_length.median()) >>> 4.35 # Mode calculated on a Series print(data.petal_length.mode()) >>> 0 1.5 dtype: float64 Performing Statistical Calculations Pandas contains a variety of statistical methods—mean, median, and mode Code Output # Mean calculated on a DataFrame print(data.mean()) >>> sepal_length 5.843333 sepal_width 3.054000 petal_length 3.758667 petal_width 1.198667 dtype: float64 # Median calculated on a Series print(data.petal_length.median()) >>> 4.35 # Mode calculated on a Series print(data.petal_length.mode()) >>> 0 1.5 dtype: float64 Performing Statistical Calculations Pandas contains a variety of statistical methods—mean, median, and mode Code Output # Mean calculated on a DataFrame print(data.mean()) >>> sepal_length 5.843333 sepal_width 3.054000 petal_length 3.758667 petal_width 1.198667 dtype: float64 # Median calculated on a Series print(data.petal_length.median()) >>> 4.35 # Mode calculated on a Series print(data.petal_length.mode()) >>> 0 1.5 dtype: float64 Performing Statistical Calculations Pandas contains a variety of statistical methods—mean, median, and mode Code Output # Mean calculated on a DataFrame print(data.mean()) >>> sepal_length 5.843333 sepal_width 3.054000 petal_length 3.758667 petal_width 1.198667 dtype: float64 # Median calculated on a Series print(data.petal_length.median()) >>> 4.35 # Mode calculated on a Series print(data.petal_length.mode()) >>> 0 1.5 dtype: float64 Performing Statistical Calculations Standard deviation, variance, SEM and quantiles can also be calculated Code Output # Standard dev, variance, and SEM print(data.petal_length.std(), data.petal_length.var(), >>> 1.76442041995 data.petal_length.sem()) 3.11317941834 0.144064324021 # As well as quantiles print(data.quantile(0)) >>> sepal_length 4.3 sepal_width 2.0 petal_length 1.0 petal_width 0.1 Name: 0, dtype: float64 Performing Statistical Calculations Standard deviation, variance, SEM and quantiles can also be calculated Code Output # Standard dev, variance, and SEM print(data.petal_length.std(), >>> 1.76442041995 data.petal_length.var(), 3.11317941834 data.petal_length.sem()) 0.144064324021 # As well as quantiles >>> sepal_length 4.3 print(data.quantile(0)) sepal_width 2.0 petal_length 1.0 petal_width 0.1 Name: 0, dtype: float64 Performing Statistical Calculations Standard deviation, variance, SEM and quantiles can also be calculated Code Output # Standard dev, variance, and SEM print(data.petal_length.std(), >>> 1.76442041995 data.petal_length.var(), 3.11317941834 data.petal_length.sem()) 0.144064324021 # As well as quantiles >>> sepal_length 4.3 print(data.quantile(0)) sepal_width 2.0 petal_length 1.0 petal_width 0.1 Name: 0, dtype: float64 Performing Statistical Calculations Multiple calculations can be presented in a DataFrame Code Output print(data.describe()) >>> Performing Statistical Calculations Multiple calculations can be presented in a DataFrame Code Output print(data.describe()) >>> Sampling from DataFrames DataFrames can be randomly sampled from Code Output # Sample 5 rows without replacement sample = (data >>>.sample(n=5, replace=False, random_state=42)) print(sample.iloc[:,-3:]) Sampling from DataFrames DataFrames can be randomly sampled from Code Output # Sample 5 rows without replacement sample = (data >>>.sample(n=5, replace=False, random_state=42)) print(sample.iloc[:,-3:]) Sampling from DataFrames DataFrames can be randomly sampled from Code Output # Sample 5 rows without replacement sample = (data >>>.sample(n=5, replace=False, random_state=42)) print(sample.iloc[:,-3:]) SciPy and NumPy also contain a variety of statistical functions. Visualization Libraries Visualizations can be created in multiple ways: Matplotlib Pandas (via Matplotlib) Seaborn Statistically-focused plotting methods Global preferences incorporated by Matplotlib Basic Scatter Plots with Matplotlib Scatter plots can be created from Pandas Series Code Output Import matplotlib.pyplot as plt plt.plot(data.sepal_length, data.sepal_width, ls ='', marker='o') Basic Scatter Plots with Matplotlib Scatter plots can be created from Pandas Series Code Output Import matplotlib.pyplot as plt plt.plot(data.sepal_length, data.sepal_width, ls ='', marker='o') Basic Scatter Plots with Matplotlib Multiple layers of data can also be added Code Output plt.plot(data.sepal_length, data.sepal_width, ls ='', marker='o', label='sepal') plt.plot(data.petal_length, data.petal_width, ls ='', marker='o', label='petal') Basic Scatter Plots with Matplotlib Multiple layers of data can also be added Code Output plt.plot(data.sepal_length, data.sepal_width, ls ='', marker='o', label='sepal') plt.plot(data.petal_length, data.petal_width, ls ='', marker='o', label='petal') Histograms with Matplotlib Histograms can be created from Pandas Series Code Output plt.hist(data.sepal_length, bins=25) Histograms with Matplotlib Histograms can be created from Pandas Series Code Output plt.hist(data.sepal_length, bins=25) Customizing Matplotlib Plots Every feature of Matplotlib plots can be customized Code Output fig, ax = plt.subplots() ax.barh(np.arange(10), data.sepal_width.iloc[:10]) # Set position of ticks and tick labels ax.set_yticks(np.arange(0.4,10.4,1.0)) ax.set_yticklabels(np.arange(1,11)) ax.set(xlabel='xlabel', ylabel='ylabel', title='Title') Customizing Matplotlib Plots Every feature of Matplotlib plots can be customized Code Output fig, ax = plt.subplots() ax.barh(np.arange(10), data.sepal_width.iloc[:10]) # Set position of ticks and tick labels ax.set_yticks(np.arange(0.4,10.4,1.0)) ax.set_yticklabels(np.arange(1,11)) ax.set(xlabel='xlabel', ylabel='ylabel', title='Title') Incorporating Statistical Calculations Statistical calculations can be included with Pandas methods Code Output (data.groupby('species').mean().plot(color=['red','blue', 'black','green'], fontsize=10.0, figsize=(4,4))) Incorporating Statistical Calculations Statistical calculations can be included with Pandas methods Code Output (data.groupby('species').mean().plot(color=['red','blue', 'black','green'], fontsize=10.0, figsize=(4,4))) Statistical Plotting with Seaborn Joint distribution and scatter plots can be created Code Output import seaborn as sns sns.jointplot(x='sepal_length', y='sepal_width', data=data, size=4) Statistical Plotting with Seaborn Joint distribution and scatter plots can be created Code Output import seaborn as sns sns.jointplot(x='sepal_length', y='sepal_width', data=data, size=4) Statistical Plotting with Seaborn Correlation plots of all variable pairs can also be made with Seaborn Code Output sns.pairplot(data, hue='species', size=3) Statistical Plotting with Seaborn Correlation plots of all variable pairs can also be made with Seaborn Code Output sns.pairplot(data, hue='species', size=3) Legal Notices and Disclaimers This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. This sample source code is released under the Intel Sample Source Code License Agreement. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2017, Intel Corporation. All rights reserved.

Use Quizgecko on...
Browser
Browser