Document Details

NiftyEuropium

Uploaded by NiftyEuropium

Bentley University

Mengchuan (Mike) Fu

Tags

artificial intelligence finance machine learning data analysis

Summary

This Bentley University presentation on AI in Finance covers the use of artificial intelligence in the financial industry. The presentation includes the advantages of AI in finance, including speed, accuracy, and consistency, as well as various applications such as algorithmic trading, credit risk modeling, and fraud detection. The presentation also touched on machine learning concepts, chatbots, and other relevant topics for the field.

Full Transcript

AI in Finance - Day 1 Mengchuan (Mike) Fu AI in Finance AI in finance refers to the use of artificial intelligence techniques and technologies to analyze financial data, make predictions, automate processes, and improve decision-making within the financial industry....

AI in Finance - Day 1 Mengchuan (Mike) Fu AI in Finance AI in finance refers to the use of artificial intelligence techniques and technologies to analyze financial data, make predictions, automate processes, and improve decision-making within the financial industry. 2 Artificial intelligence vs. human intelligence Artificial Intelligence (AI) is intelligence demonstrated by machines Machine that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving” It is about building machines that can mimic human behavior, learn from previous experience, and perform human-like tasks 3 Advantages of AI in Finance Speed and Efficiency AI algorithms can process vast amounts of financial data at high speeds, enabling rapid analysis and decision-making. Accuracy and Consistency AI systems are capable of analyzing data with precision and consistency, reducing the likelihood of errors or biases that may arise from human judgment. Data Analysis and Pattern Recognition AI algorithms excel at identifying complex patterns and relationships within financial data. 4 Advantages of AI in Finance Automation of Repetitive Tasks AI technologies can automate repetitive and time-consuming tasks, freeing up human resources to focus on more strategic activities that require creativity and critical thinking. Scalability AI solutions can scale effortlessly to handle large volumes of data and adapt to changing business requirements. 24/7 Availability AI-driven systems can operate continuously without the need for breaks, allowing for round-the-clock monitoring and decision-making in global financial markets. 5 Alan Turing Alan Mathison Turing (23 June 1912 – 7 June 1954) English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Father of theoretical computer science and artificial intelligence. 6 Turing Test The Turing Test is a method of inquiry in artificial intelligence (AI) for determining whether or not a computer is capable of thinking like a human being. During the test, one of the human functions as the questioner, while the second human and the computer function as respondents. The questioner interrogates the respondents within a specific subject area, using a specified format and context. After a preset length of time or number of questions, the questioner is then asked to decide which respondent was human and which was a computer. 7 ChatGPT Created by San Francisco-based tech lab OpenAI, ChatGPT is a generative AI software application that uses a machine learning technique called reinforcement learning from human feedback (RLHF) to emulate human-written conversations based on a large range of user prompts. ChatGPT learns language by training on texts gleaned from across the internet, including online encyclopedias, books, academic journals and blogs. Based on this training, the AI chatbot generates text by making predictions about which words (or tokens) can be strung together to produce the most suitable response. 8 How is ChatGPT trained Generative pretraining Train raw language model on text data Supervised fine-tuning (SFT) Further trained to mimic ideal chatbot behavior demonstrated by humans Reinforcement learning from human feedback Human preferences over alternative model outputs will be used to define a reward function for additional training with RL 9 Generative pretraining Autoregressive sequence model - P(Xt+1|X1,…,Xt) During training, extract sequences from a data set and adjust the model’s parameters to maximize the probability assigned to the true Xt+1 conditioned on the histories. The language model using word tokens will output a probability distribution over all the words in its vocabulary indicating how likely each word is to come next. Repeat this process over and over until a special stop token is selected. History (X1,…,Xt) Next element (Xt+1) AI in ???? P(?=education|X1,…,Xt)=0.2 P(?=gaming|X1,…,Xt)=0.1 P(?=healthcare|X1,…,Xt)=0.09 P(?=finance|X1,…,Xt)=0.08 10 Supervised fine-tuning (SFT) Human play on both side of the chat Each training data consists of a particular conversation history paired with the next response of the human acting. Given a particular history, the objectives is to maximize the probability the model assigns to the sequence of tokens in the corresponding response. ………… Histor ………… y ………… Next response ………… 11 Reinforcement learning from human feedback Establish a reward function based on human preferences AI trainers first have conversations with the current model For any given model response, a set of alternative responses are also sampled Human labelers ranks them according to most to least preferred. A B C D C > D > B > A 12 Machine learning Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data. In essence, machine learning allows computers to recognize patterns and relationships within data without being explicitly programmed to do so. Machine learning Supervised learning Unsupervised learning Reinforcement learning 13 Supervised v.s. Unsupervised Learning Supervised learning algorithms work with labeled data. Unsupervised learning algorithms work with unlabeled data. Semi-supervised learning algorithms work with a combination of labeled and unlabeled data (extrapolated what it leans from the labeled data to the unlabeled data and draw conclusions from the set as a whole). 14 Reinforcement learning Unlike supervised and unsupervised learnings, reinforcement learning has a feedback type of algorithm. Reinforcement learning is the type of learning methodology where we give rewards of feedback to the algorithm to learn from and improve future results. 15 AI in finance Algorithmic trading Credit risk modeling / Fraud detection Financial forecasts, planning and analysis Customer service automation User behavior analysis 16 Algorithmic trading Algorithmic trading involves using computer code to automatically enter and exit trades once certain criteria are met. Eliminate the effect of emotions Increase activity and improve execution speed Reduce the time spend on analysis Enable to back-test strategies 17 Algorithmic trading - Supervised learning Extract signals from market Risk Factors Return Forecasts Volatility Forecasts Statistical Arbitrage / Long-short signals - mean reversion analyses Predictive models Regularized linear model – Enet and other variants Ensemble models – random forests, gradient boosting, etc. Bayesian ML - quantify uncertainty about future events 18 Algorithmic trading - Unsupervised learning Dimensionality reduction and clustering Risk factors Portfolio creation Natural Language Processing - Extracting insights and recognizing patterns from unstructured data (Financial text data, financial news, earnings call transcripts and alternative data sources) Sentiment analysis Topic modeling Word embedding 19 Algorithmic trading – Deep Learning Learn and synthesize increasingly complex patterns Deep Convolutional Neural Networks (CNN) Recurrent Neural Networks (RNN) Long Short-Term Memory (LSTM) Gated Recurrent Units (GRU) Autoencoders / Convolutional Autoencoders - reproduce the input Use case Multivariate Time Series data Satellite Images Nonlinear feature extraction 20 Algorithmic trading – Reinforcement Learning Algorithms Q Learning Asynchronous Advantage Actor-Critic (A2C/A3C) Proximal Policy Optimization (PPO) Trust Region Policy Optimization (TRPO) Deep Deterministic Policy Gradient (DDPG) Use cases Portfolio management Order placement / Q-Trading 21 Credit risk modeling / Fraud detection Consumer Credit scoring Credit risk forecasting Credit card fraud detection Corporate Corporate credit assessment Bankruptcy Mortgage risk Peer-to-peer lending 22 Financial forecasts, planning and analysis (FP&A) Sales, marketing, service Finance Human resources Sourcing and supply chain External Sales forecasts Revenue forecasts Head count requirements Production cost forecasts Weather Market analysis Expense forecasts Salary forecasts Shipping costs forecasts Economic indicator Demographic changes Training costs Customer propensity Predictive and prescriptive data analysis on past present and future performances Warning signs Real-time monitoring for compliance Understanding business’s key metrics 23 Please introduce yourself Personal Background and Education Briefly introduce yourself, mentioning your name and any relevant academic background Relevant Skills and Expertise Highlight your technical skills in AI, such as programming languages (Python, R), machine learning algorithms, data analysis, and neural networks. Mention your expertise in finance, including knowledge of financial markets, investment strategies, risk management, and financial modeling. Passion for AI and Finance Share your journey of how you became interested in AI and finance. Discuss any pivotal moments or influences that sparked your passion for these fields. 24 Anaconda Anaconda is a powerful and convenient tool for anyone interested in data science, coding, and scientific research. It’s like a big toolbox filled with tools for coding, data analysis, and scientific research. By providing a wide range of pre-installed libraries and a user-friendly interface, it makes it easier for beginners to start learning and working on their projects without getting bogged down by complicated setup processes. It simplifies package management and deployment Open-source distribution of the Python and R programming languages It's specially designed for people who work with data and want to use programming languages like Python and R 25 Why Use Anaconda? Ease of Use: It makes setting up and managing your programming environment much easier, especially if you're new to coding. All-in-One Package: You get everything you need in one place, without having to install each tool separately. Popular in Data Science: Anaconda is widely used in the fields of data science and machine learning, so learning to use it can give you a head start if you're interested in these areas. 26 Orange 3 Orange 3 is a powerful and versatile open-source data visualization and analysis tool, specifically designed for visual programming and machine learning. It is included as part of the Anaconda distribution and can be easily accessed through Anaconda Navigator. By using Orange 3 within Anaconda Navigator, users can leverage the comprehensive ecosystem of Anaconda for data science, including package management and environment handling, making it easier to manage dependencies and work on different projects. 27 Key features of Orange 3 Visual Programming Interface: create data analysis workflows visually by dragging and dropping widgets, making it accessible for those who may not be comfortable with coding. Machine Learning: supports a wide range of machine learning algorithms and provides tools for tasks such as classification, regression, clustering, and more. Users can build and evaluate predictive models with ease. Data Visualization: The platform offers extensive visualization options, enabling users to explore data through scatter plots, box plots, histograms, heatmaps, and other graphical representations. Integration with Python: While it is highly user-friendly for non-programmers, Orange 3 also integrates seamlessly with Python, allowing more advanced users to extend its functionality through scripting. 28 Access Orange 3 in Anaconda Navigator Open Anaconda Navigator - Launch the Anaconda Navigator application from your computer. Launch Orange 3 - click the "Launch" button next to Orange 3 to start the application. 29 Widgets Widgets: The core of Orange 3 consists of widgets. Widgets are modular units that perform specific tasks within the data analysis workflow. Each widget represents a specific functionality Users can combine these widgets into a workflow to perform complex data analyses. 30 Read data Reads data Reads comma-separated Load a dataset Reads data from Excel files and sends the dataset from an online from an SQL (.xlsx), simple to its output channel. File repository - database. It tab-delimited separators can be commas, retrieves selected can connect to (.txt), comma- semicolons, spaces, tabs or dataset from the PostgreSQL separated files manually-defined server and sends it or SQL (.csv) or URLs delimiters. to the output. Server. 31 Task 1 Load Banking Crises data Click “Datasets” to add a datasets widget into your canvas. Double click the newly added widget (“Datasets”) and select “Banking crises” 32 Widgets communication Widgets communicate with one another. Each widget has an input channel, an output channel or both. Ways to add widgets to the workflow Click on the widget in the widget panel Click and drag the widget onto the canvas to place it exactly where you want it to Right click on the canvas and the widget menu will appear Drag a communication channel from the output of another widget 33 Task 2 Add a few widgets to the workflow and interact with the loaded datasets. 1. Data table (Data) 2. Data info (Data) 3. Distributions (Visualize) 4. Scatter plot (Visualize) 34 Feature types Feature types (also known as variable types or attribute types) play a crucial role in determining how data is processed, analyzed, and visualized. Categorical (Discrete) - represent distinct categories or classes. Each value is a label from a finite set of possible categories. Examples: Gender (Male, Female), Species (Setosa, Versicolor, Virginica). Usage: Often used in classification problems, where the goal is to predict a category. Numerical (Continuous) - represent continuous values and can take any numerical value within a range. Examples: Height, Weight, Temperature. Usage: Commonly used in regression problems and various statistical analyses. 35 Feature types Time - represent time-related data, which can be in formats such as dates or timestamps. Examples: Date of Birth, Timestamp of an event. Usage: Used in time series analysis, forecasting, and temporal data exploration. Text - consist of textual data, which can be strings or larger blocks of text. Examples: Product Reviews, Customer Feedback. Usage: Utilized in natural language processing (NLP) tasks, text mining, and sentiment analysis. Meta - are used for additional information about the dataset or for storing auxiliary data. They do not directly participate in the analysis. Examples: Identifiers, comments, or other descriptive information. 36 Usage: Helpful for annotation, identification, or descriptive purposes. Supervised learning Supervised learning is a key concept in the field of machine learning and AI, where an algorithm learns from labeled training data to make predictions or decisions without being explicitly programmed to perform the task. Supervised learning involves training a model on a labeled dataset, which means that each training example is paired with an output label. The goal is for the model to learn a mapping from inputs to outputs that can then be applied to new, unseen data. 37 Supervised learning 38 Types of Supervised Learning Regression - Used for predicting continuous values. Predicting stock prices- Predict future prices based on historical prices and other relevant financial indicators Portfolio Management - Estimating the expected return of a portfolio based on the historical returns of its constituent assets. Classification - Used for predicting discrete categories. Credit Scoring: Predicting whether a loan applicant is likely to default on their loan based on their credit history and other financial attributes. Fraud Detection: Identifying whether a particular transaction is fraudulent or not based on transaction patterns. Customer Segmentation: Classifying customers into different segments based on their purchasing behavior, credit scores, and other demographic factors for targeted marketing strategies. 39 How Supervised Learning works? Training Phase: The model is fed with a set of input-output pairs. The algorithm uses these pairs to learn the relationship between the inputs and the outputs. Validation Phase: The model's predictions are compared against known outputs to evaluate its performance. Techniques like cross-validation can be used to tune the model parameters. Testing Phase: The model is tested on a separate, unseen dataset to assess its generalization ability. 40 How Supervised Learning works? 41 Hyperparameter tuning Hyperparameters are parameters that are not learned from the data but are set prior to the training process. Proper tuning can significantly improve a model's accuracy and generalization capability. Hyperparameter tuning is a crucial step in the machine learning process, especially for supervised learning models. It involves selecting the best set of hyperparameters for a learning algorithm to optimize its performance. 42 In-sample v.s. Out-of-sample In-sample data refers to the data that the model is trained on. This includes the dataset used during the training phase where the model learns the underlying patterns and relationships. The main goal of using in-sample data is to fit the model, adjusting its parameters to minimize the error on this dataset. Out-of-sample data refers to data that the model has not seen during training. This includes validation and test datasets used to evaluate the model’s performance and generalization capability. The main goal of using out-of-sample data is to provide an unbiased estimate of how the model will perform on new, unseen data. 43 Group project The dataset provided covers the daily price and volume of shares of 31 NASDAQ companies for the year 2022. The predictors are technical indicators used by technical traders (e.g., RSI, moving averages, stochastic oscillators). The target variable is the return of holding the stock for the following 5 open market days (INCREMENTO). Objective: Develop AI models to form a trading strategy that is likely to provide the highest returns. Use data from January to November to train the model. Use data from December 1st to December 21st to test your model performance. Assume it is December 22nd, 2022. Apply your trading strategy and evaluate the returns. 44 Group project Relative Strength Index (RSI): RSIadjclose15, RSIvolume15, RSIadjclose25, etc. RSI measures the magnitude of recent price changes to evaluate overbought or oversold conditions. It is typically calculated for different time periods like 15, 25, or 50 days. Moving Average Convergence Divergence (MACD): MACDadjclose15, MACDvolume15, MACDadjclose25, etc. MACD is a trend- following momentum indicator that shows the relationship between two moving averages of a security’s price or volume. MACDsig-adjclose-15, MACDdif-adjclose-15-0, etc.: These are components of the MACD indicator, including the signal line and the difference between the MACD line and the signal line (histogram). 45 Group project Exponential Moving Average (EMA): emaadjclose5, emavolume5, emaadjclose10, etc.: EMA is a type of moving average that gives more weight to recent prices, making it more responsive to new information. Calculated for different time periods like 5, 10, or 50 days. Simple Moving Average (SMA): smaadjclose5, smavolume5, smaadjclose10, etc.: SMA is the arithmetic mean of a given set of prices over a specific number of days in the past, such as 5, 10, or 50 days. 46 Group project Bollinger Bands: bollingerMA5-5adjclose, bollingerBU5-5adjclose, bollingerBL5-5adjclose, etc.: Bollinger Band moving average. These bands are volatility bands placed above and below a price band and contract based on market volatility. Average True Range (ATR): atr5, atr10, atr15, atr20: ATR measures market volatility by decomposing the entire range of an asset price for a given period. It is calculated for different time periods like 5, 10, or 20 days. 47 Group project Commodity Channel Index (CCI): cci5, cci10, cci15, cci25, cci40, cci50: CCI measures the difference between the current price and the historical average price. It is used to identify cyclical trends in security prices. Stochastic Oscillator: K-5, D-5, stochastic-k-5, stochastic-d-5, etc.: The Stochastic Oscillator compares a particular closing price of a security to a range of its prices over a certain period. It consists of two lines: %K and %D. 48 Group project Volume Indicators: volumenrelativo: Measures the relative volume of trades compared to the historical average. Pattern Recognition: hammer1y1highhighhigh, hammer1y1highhighlow, etc.: These appear to be custom pattern recognition metrics, possibly indicating the presence of certain candlestick patterns like hammers over various timeframes. 49 Linear regression Linear Regression is a fundamental statistical and machine learning technique used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line (or hyperplane in higher dimensions) that describes the relationship between the variables. 50 Least Squares Method 53 Task 3 Load Housing data Click “Datasets” to add a datasets widget into your canvas. Double click the newly added widget (“Datasets”) and select “Housing” Get an initial overview of the data Link the output of "Datasets" to a "Data Table” Link the output of "Datasets" to a “Feature Statistics” 55 Housing Data The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. The regression goal is to predict median value of owner-occupied homes Output variable (desired target): MEDV Median value of owner-occupied homes in $1000’s Input variables (13) 56 Housing Data CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT % lower status of the population 57 Task 3 Run a linear regression Link the output of "Datasets" to a “Linear Regression” Review the regression findings Link the output of "Datasets" and “Linear Regression” to a “Predictions” Link the output of “Linear Regression” to a “Data Table” Check the prediction results Double click “Predictions” 58 Task 3 59 Logistic regression Logistic regression is a statistical method used for binary classification, which means it is used to predict the outcome of a dependent variable that can take one of two possible outcomes. 60 Logistic Function (Sigmoid Function) The logistic regression model uses the logistic function to model the probability of the dependent variable. The logistic function is defined as where z is a linear combination of the independent variables 61 Logit Function The logit function is the natural logarithm of the odds: Logistic regression models the logit of the probability as a linear combination of the independent variables. The output of the logistic function is a probability that lies between 0 and 1. The odds of the dependent variable being 1 (as opposed to 0) are given by the ratio of the probability of 1 to the probability of 0. 62 Task 4 Load Bank Marketing data Click “Datasets” to add a datasets widget into your canvas. Double click the newly added widget (“Datasets”) and select “Bank Marketing” Get an initial overview of the data Link the output of "Datasets" to a "Data Table” Link the output of "Datasets" to a “Feature Statistics” 64 Bank Marketing Data The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). Output variable (desired target): y - has the client subscribed a term deposit? (binary: "yes","no") Input variables (16) bank client data related with the last contact of the current campaign other attributes 65 Bank Marketing # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self- employed", "retired", "technician", "services") 3 - marital : marital status (categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed) 4 - education (categorical: "unknown", "secondary", "primary", "tertiary") 5 - default: has credit in default? (binary: "yes","no") 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: "yes","no") 8 - loan: has personal loan? (binary: "yes","no") 66 Bank Marketing # related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: "unknown", "telephone", "cellular") 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: "jan", "feb", "mar",..., "nov", "dec") 12 - duration: last contact duration, in seconds (numeric) # other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this 67 client (numeric) Task 4 Select the input and output variable Link the output of "Datasets" to a “Select Columns” Run a logistic regression Link the output of “Select Columns” to a “Logistic Regression” Review the regression findings Link the output of “Select Columns” and “Logistic Regression” to a “Predictions” Link the output of “Logistic Regression” to a “Data Table” Check the prediction results Double click “Predictions” 68 Task 4 69 Overfitting Overfitting occurs when a machine learning model captures the noise and details of the training data to such an extent that it negatively impacts the performance of the model on new, unseen data. This means the model performs very well on the training data but poorly on the test data or any new data. Overfitting is a common problem in machine learning, particularly when the model is too complex relative to the amount and variability of the data. Complex Models: Using models with too many parameters (e.g., deep neural networks, high-degree polynomial regressions) relative to the size of the dataset. Insufficient Training Data: Small datasets are more susceptible to noise, which the model may learn. Noise in the Data: Irrelevant features or random errors in the training data can lead the model to learn patterns that do not generalize. 70 Overfitting 71 Overfitting Poor Generalization: The primary consequence of overfitting is that the model performs well on the training data but poorly on new, unseen data. This lack of generalization makes the model unreliable for practical use. 72 Regularization Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This penalty discourages the model from fitting too closely to the training data by constraining the magnitude of the coefficients. Ridge Regression (L2 Regularization): adds a penalty equal to the sum of the squares of the coefficients (excluding the intercept) to the loss function Lasso Regression (L1 Regularization): adds a penalty equal to the sum of the absolute values of the coefficients to the loss function Elastic Net: Elastic Net combines L1 and L2 regularization 73 Regularization Ridge Regression (L2 Regularization): Lasso Regression (L1 Regularization): Elastic Net: 74 Regularization 75 Task 5 Add L1 regularization to a logistic regression Link the output of " Select Columns" to a “Logistic Regression” Double click ““Logistic Regression”, under “Regularization type”, select “L1” Link the output of " Select Columns " and “Logistic Regression” to a “Predictions” Link the output of “Logistic Regression” to a “Data Table” Compare with the previous results. 76 Thank you

Use Quizgecko on...
Browser
Browser