Deep Learning and Variants_Session 3 _20240121.pdf

Presents Deep Learning & its variants GGU DBA Training Neural Networks Dr. Anand Jayaraman Professor, upGrad; Chief Data Scientist, Agastya Data Solutions How are the weights obtained? How is the MLP trained? Learning is changing weights In the very simple cases – Start random – If the output is correct then do nothing. – If the output is too high, decrease the weights attached to high inputs – If the output is too low, increase the weights attached to high inputs How to train a network? Let be input label pairs from the dataset. Given an input, is the network output. Where W is the set of network parameters. The cost function is defined as (SSE) Our task is to minimize E by updating W. How can we do that? Finding the minima: Gradient Descent Gradient descent? Given function E with parameters W E can be minimized by moving in direction opposite to gradient of E w.r.t W. In practice, we scale down the gradient by a parameter, 𝛼, called learning rate. Learning rate is between 0 and 1 GD and local minima ANN learning Learning is changing weights In the very simple cases – Start random – If the output is correct then do nothing. – If the output is too high, decrease the weights attached to high inputs – If the output is too low, increase the weights attached to high inputs 𝑤𝑡+1 𝜕𝐸 = 𝑤𝑡 − 𝛼 𝜕𝑊 ANN Learning: Back-propagation The method of computing the sensitivity of the error to 𝜕𝐸 the change in weights, , is called Back-propagation 𝜕𝑊 The term is an abbreviation for “Backward propagation of Errors” Popularized by a paper by Geoffery Hinton The led to a renaissance in the area of Neural Networks Multiple methods of updating weights Full Batch Online Mini-Batch Regular Training as done for all other ML algos. Entire input is used to train. Update weights using Grad Descent. Show one input row at a time…adjust weights, show another and adjust weights…,once the input is all over, start with row 1 if needed Pick a small random sample of inputs. Perform batch update. Then pick another set of sample inputs randomly… More on min-batch gradient descent Let us say we have 500K samples and let us say, each mini-batch contains 512 samples. We make ~1000 mini batches We construct a matrix of 512 samples and forward prop. Make 512 predictions. Then find average cost over 512 samples. Then update the weights once. An epoch is doing all 1000 mini batches once. We do multiple epochs before convergence If mini batch is 1, it is stochastic gradient descent. If it is 500K you have gradient descent For less than 2000 samples, go for batch. For large data sets, the mini-batch is 64-512 (a power of 2) is good. NN Training: Local versus global Gradient descent is greedy Multiple initiations and selection of the best is the option No way to theoretically guarantee convergence to global optimum. But, generally convergence to a local optimum which is very close to global. Therefore state-of-the-art results :) Inventory Management: Predicting Backorders sku - Random ID for the product national_inv - Current inventory level for the part lead_time - Transit time for product (if available) in_transit_qty - Amount of product in transit from source forecast_3_month - Forecast sales for the next 3 months forecast_6_month - Forecast sales for the next 6 months forecast_9_month - Forecast sales for the next 9 months sales_1_month - Sales quantity for the prior 1 month time period sales_3_month - Sales quantity for the prior 3 month time period sales_6_month - Sales quantity for the prior 6 month time period sales_9_month - Sales quantity for the prior 9 month time period 21 features min_bank - Minimum recommend amount to stock potential_issue - Source issue for part identified pieces_past_due - Parts overdue from source perf_6_month_avg - Source performance for prior 6 month period perf_12_month_avg - Source performance for prior 12 month period local_bo_qty - Amount of stock orders overdue deck_risk - Part risk flag oe_constraint - Part risk flag ppap_risk - Part risk flag stop_auto_buy - Part risk flag rev_stop - Part risk flag went_on_backorder - Product actually went on backorder. This is the target value. First hidden layer with 10 neurons # of input feature s 2nd hidden layer with 18 Output neurons layer with 1 output Practical Guidelines How many hidden layers do we need? Most structured-data problems can be handled with at-most 2 layers. Rare are problems where higher layers can be justified. How many neurons/nodes do I use? The previously mentioned “rules of thumb” are merely that – they are not absolute rules. Experimentation is need needed to figure out the number of neurons. Number of neurons is a hyper-parameter and like all hyper-parameters best way to determine it, is through cross-validation See ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu Network Topology: 90%: One hidden layer 10%: Two hidden layers Rarely more than that! It might lead to overfitting! Topology: Example 2x2 ANN Training accuracy: 97.36% Test accuracy: 89.38% Topology: Example 4x3 ANN Training accuracy: 100% Test accuracy: 86.24% Comparison to other ML methods Neural Networks Decision Trees SVM Regression models Structured data Yes Yes Yes Yes Unstructured data (images, text, audio) Yes No No No Feature extractor + classifier Yes Classifier Classifier Classifer Number of training samples required Large Medium-low Medium-low Medium Hyper-parameter tuning Heavy low low low

Deep Learning and Variants_Session 3 _20240121.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue