Advanced Machine Learning PDF

I MANCHESIER 1:--:2~ The University of Manchester Alliance Manchester Business School BMAN73701 Programming in ~ puthon for Business Analytics Python Week 5: Lecture 2 Advanced Machine Learning Prof. Manuel López-Ibáñez [email protected] Office hours: Mon 4pm-5pm, Fri 9am-10am https://calendly.com/manuel-lopez-ibanez MANCHEsTER 1824 The University of Manchester From Raw data to Data Analysis Alliance Manchester Business School Raw Data Sources Raw (Databases, Web, Acquisition Tidying Tabular Data data Excel, Text, APIs) Numerical Categorical Ordinal Preprocessing Summary stats, v analysis, Data Analysis visualisation Tabular Data BMAN73701 Week 5 3 MAN,CHEsTER 1824 The University of Manchester Machine Learning with scikit-learn Allia nee Manchester Business School Built on top of NumPy and Matplotlib  Input may be Numpy or Pandas DataFrame, Output is Numpy Open-source, free to use and contribute Rapidly improving! Object-oriented: create objects, call their methods to fit (train) them....,...,. ~ or transform other data Jupyter ~ Andreas C. Muller & Sarah Guido Docs: http://scikit-learn.org/stable/documentation.html Examples: http://scikit-learn.org/stable/auto_examples/index.html API reference: http://scikit-learn.org/stable/modules/classes.html BMAN73701 Week 5 4 CHE'ilER. 1824 The Unr.iersity of Manchester A High-level view of ML Alliance Manchester Business School Unsupervised ML “Learn something without knowing any answers” Supervised ML “Given these examples of answers, learn to answer” Classification: assign categories, labels (discrete) Regression: predict a real-valued number (continuous) BMAN73701 Week 5 5 , MANCHEsTEiz 182-t Supervised ML The University of Manchester Training Data Alliance Manchester Business School 1) Train/test random split Xtrain ytrain X y Test Data Xtest ytest 2) Train ML model Xtrain ytrain 3) Score on the test set ytest Xtest ypred output layer input layer output layer L input layer hidden layer BMAN73701 Week 5 hidden layer 6 IvlANCHEsTER 1824 The University of Manchester K-fold Cross-validation Alliance Manchester Business School Training Data X y 1) Train/test random split Xtrain ytrain Test Data Xtest ytest 2) K-fold split of the training data D Validation Set Training Set Training Data (K-1) Round 1 K-1 Round 2 Xtrain Round 3 Round 10 ytrain Xtrain_1 ytrain_1 folds 1 fold Validation Data Xval_1 yval_1 Xtrain_1 ytrain_1 3) Train ML model for each possible Xval_1 L yval_1 K-fold split and ypred_1 compute score on validation data BMAN73701 hidden layer Week 5 hidden layer 7 MAN CH I "1 I Ell I :-:. I The University of Manchester K-fold Cross-validation Alliance Manchester Business School  If K is too small  faster but poor generalization too few rounds, training fold may be too small  If K is too large  takes more time, more variance, validation fold may be too small BMAN73701 Week 5 8 MAN CH I "1 I Ell I :-:. I The University of Manchester K-fold Cross-validation Alliance Manchester Business School Round Round Round Round Round 1 2 3 4 5 Class 0 Class 1 BMAN73701 Week 5 9 MAN CH I "1 I Ell I :-:. I The University of Manchester Stratified K-fold Cross-validation Alliance Manchester Business School Round Round Round Round Round 1 2 3 4 5 Class 0 Class 1 Stratified K-fold Round Round Round Round Round 1 2 3 4 5 Class 0 Class 1 BMAN73701 Week 5 10 MAN CH I "1 I Ell I :-:. I The University of Manchester Stratified K-fold Cross-validation Alliance Manchester Business School For classification problems, same proportion of class labels Within train and test sets Within each fold ✓ When classes are unbalanced, use stratification: train_test_split(stratify=y_labels) cross_val_score() uses stratified k-fold CV automatically BMAN73701 Week 5 11 MAN CH I "1 I Ell I :-:. I The University of Manchester Supervised: fit, predict, score Alliance Manchester Business School model.fit(x_train, y_train) # Supervised Build the model: Decision tree: learn decision points/branches Neural Network (MLP): learn weights of neurons BMAN73701 Week 5 12 MAN CH I "1 I Ell I :-:. I The University of Manchester Supervised: fit, predict, score Alliance Manchester Business School model.fit(x_train, y_train) # Supervised Build the model model.predict(x_test) Create new output: Classifiers: predict label for each input Regression: predict numerical output for each input model.score(x_test, y_test) Predict and compare prediction with given output: Classifiers: accuracy, … Regression: R2, … BMAN73701 Week 5 13 MAN CH I "1 I Ell I :-:. I The University of Manchester Today Alliance Manchester Business School Random Forests Confusion matrices and other scoring metrics Hyper-parameter optimisation Machine Learning Pipelines BMAN73701 Week 5 14 I MANCHESIER 1:--:2~ The University of Manchester Alliance Manchester Business School BMAN73701 Programming in ~ puthon for Business Analytics Python Week 5: Lecture 2 Advanced Machine Learning Part 1: Random forests and parameter importance Part 2: Confusion matrices and scoring metrics Part 3: Hyper-parameter optimisation Part 4: Pipelines I MANCHESIER 1:--:2~ The University of Manchester Alliance Manchester Business School BMAN73701 Programming in ~ puthon for Business Analytics Python Week 5: Lecture 2 Advanced Machine Learning Part 1: Random forests and parameter importance Part 2: Confusion matrices and scoring metrics Part 3: Hyper-parameter optimisation Part 4: Pipelines MANCH Ls I ER. ]S.2~ Random Forests The University of Manchester Alliance Manchester Business School Tree 1 Tree 2 Tree N / C A I … A B B ' C A ' ' C B C Vote or average Ensemble of decision trees – Random: random decisions when building the trees – Forest: many trees BMAN73701 Week 5 17 MANCH Ls I ER. ]S.2~ Random Forests The University of Manchester Alliance Manchester Business School 𝑿 Tree 1 ,-------1,---------- Tree 2 Tree N / C A I … splits based on different features A B B ' C A ' ' C B C classif = informasion gain regres = var. reduction features that result in improvement Vote or average = more important majority vote = classification problem C average = regression ✓ Avoid overfitting ✓ More powerful ✓ Measures feature importance BMAN73701 Week 5 18 Default in Credit Card Payments In : €!IJ, l!Mll.. 11\~M!I. stratify later /;'.,(1/""" !;111.. ]~Ml1~MI1 liMTirn · ·lf\'illll U.l.1rn Ho,1!. 1 6636 1:w: :i:ll/:1'(,~f 1rm1. ;;m1~ll~Mll 1rn«m~ · ·lf"IIJ.llll.1rntl[o.oi!- Name: default, dtype: int64 1: w: ]: » 1,~~1~f lrll;. ~MMMMD 1rn,rnrn · -1rn1rnll.1rn Ho,1!. 1:ru: ]pl,1~.~f lrK~ J~l1~J1~MJ1 1r·u«m·u · -1n1J.llll.1rn-o:lfo· 𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 =  Avoid false positives we don't want important email 𝑇𝑃 + 𝐹𝑃 flagged as spam (make sure that we predict correctly the positive class) when FN are costly (e.g detecting diseases) --> we 𝑇𝑃 don't want people with + covid 𝐑𝐞𝐜𝐚𝐥𝐥 =  Avoid false negatives have -ve covid result 𝑇𝑃 + 𝐹𝑁 (make sure we predict all true positive) 2 · 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑅𝑒𝑐𝑎𝑙𝑙 𝐅𝟏 𝐬𝐜𝐨𝐫𝐞 = when FN and FP are both 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙  Consider both Precision and Recall important (better than accuracy for imbalanced data) BMAN73701 Week 5 23 IVlANCHESTER_ 1824 Classification metrics The Urniversity of Manchester Alliance Manchester Bus.ine-ss School y_true = [0,1,0,1,1,0,1,1,1,1] y_pred = [0,1,1,0,1,1,0,0,1,1] Predicted Predicted class Confusion matrix? 0 1 0 1 True 0 TN FP True 0 1 2 class 1 FN TP class 1 3 4 TP+TN Accuracy= TP +TN+ FP + FN What ratio of the actual positives was identified by the model? TP Precision= TP + FP Recall: 4/7 = 0.57 TP What ratio of predicted positives are actual positives? Recall= TP + FN Precision: 4/6 = 0.67 2 · Precision · Recall Fl score = Precision+ Recall How many are classified correctly? Accuracy: 0.5 F1 score? 0.62 BMAN73701 Week 5 24 Confusion matrix of Credit Default Predicted In : sccr et xf es t , y_test) f or es t v default score is accuracy 0 1 8.7873333333333333 True 0 TN FP In [ 31] : f rOII % l~J.,r'. ,rn 111rn.. 1nw~·~: 1r:i.:::.:.:;: u.~i1%% l1 lfi),sill1"u«Jl«1l% "-8:((i)li (~" ·w li',si1lfll~~.. ll?:i«ff 'i}1lli.'11 ,s)11fll~~ II:}, 1:ti:1urr1%lF~:ir1r'l,«!k~.:rt : J·~MMJ1~~ fji'!,.nT:i -~,~. ,1 ~J1 ·a:·ti) :J.~MMMD llti«n-H:rn rLCi)lu.n1r1mr,% l['ii:«JH:~al JA!- «:.nlumnm\'1% li :: II.:11]rl[H:'1 · II \.\i1,.II. JWMMMt1 liMTirn · -1rn1rnll.11\'1 Uo,1!. (:J: ]t~m1: :i:t J~i1~i1~Ml1 wi,«m·u. -1n1J.llll.1rntlfo·"l!- 1: :ntJICt~·.-11 J:«m:~1 ]~Mli~MI1 li'!1«m·u. ·Ii '11.l.ll U.l.li\1"0:lfij.cij!. In : (df[ J.value_counts()) lrl~~J:ru:t1 JM]: : J~Ml1~MI1 1rn«m~. -11 w.IIU.l.1rntfo,1!. 0 23364 /;'.,(1/""" !;111.. ]~Ml1~MI1 liMTirn · ·lf\'illll U.l.1rn Ho,1!. 1 6636 1:w: :i:ll/:1'(,~f 1rm1. ;;m1~ll~Mll 1rn«m~ · ·lf"IIJ.llll.1rntl[o.oi!- Name: default, dtype: int64 1: w: ]: » 1,~~1~f lrll;. ~MMMMD 1rn,rnrn · -1rn1rnll.1rn Ho,1!. 1:ru: ]pl,1~.~f lrK~ J~l1~J1~MJ1 1r·u«m·u · -1n1J.llll.1rn-o:lfo·

Advanced Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript