Summary

This document provides a summary of key ideas for understanding asset pricing theory and machine learning in finance. It includes sections on using machine learning to create investment portfolios, measuring portfolio performance, and a discussion of specific quantitative techniques.

Full Transcript

Abstract The notes below contain a summary of the key ideas you need to understand and to a lesser extent remember before you go to the exam. This will allow you to get close to ready for most of the questions concerning part 2 and 3 of the class. To move from this to completely ready, you would...

Abstract The notes below contain a summary of the key ideas you need to understand and to a lesser extent remember before you go to the exam. This will allow you to get close to ready for most of the questions concerning part 2 and 3 of the class. To move from this to completely ready, you would need to make sure you understand the main point of every figure and table taken from a paper and discussed in class. You should also make sure you understand the assignment and tutorial well enough to comment on a piece of code. One of the questions will show you a correct code taken from the assignment and you will be asked to explain what the code does. In this question, as in all others, we don’t expect a high level of detail, just proof that you understand the main idea. If any looks like a question is too easy, it likely is because it is! Many questions are designed to test basic understanding. Don’t overthink! 1 I. Asset Pricing theory vs Machine Learning Asset pricing theory 1. suppose no arbitrage (no free money) and no statistical arbitrage (no portfolio should have a very high expected return without taking very high risks). No-statistical ar- bitrage implies that the returns of stocks can be explained by a factor structure with only a few (5-10 max) factors explaining returns. Machine learning 1. On paper it has a very high performance. 2. Because Machine Learning has used a lot of parameters, they are hard to reconcile with a world where a few factors can explain the return. 3. This is a challenge to Machine learning theory. Although, it’s also hard to explain why if there is so much money to be made, nobody has made it. Hence it’s also possible the performance of ML on paper is spurious and researcher missed something in their tests. How ML is used in finance 1. Either to process data that can’t be processed without (text, image etc.) 2. Or to process traditional data (firm characteristics) better than the old model. II. Using ML to make a portfolio Part A: Build the signals 1. Split the sample into training, validation, test. 2. Train a set model (all with different hyperparameters) on the training sample. 3. Use the validation sample to measure performance and see which hyperparameters are optimal. 4. Keep the forecast of the best model on the test sample 5. Repeat step 1-4 and concatenate the results to get a set of forecast out of sample. Call this signal. Part B: Build the portfolio 1. Use the out-of-sample signals to build portfolio weights. Those can be: Long-short decile: buy top 10% signals, sell (short) bottom 10%. That strategy can be, 1 – Equally Weighted (EW). In that case we just buy (sell) each stock in the buy (sell) decile once. This strategy tends to perform well but can’t be implemented at scale. – Value Weighted (VW). In that case, we buy (sell) the stock in the buy (sell) decile proportionally to their market cap (size). This strategy has lower transaction cost, is more implementable, but usually performs less. This is because stocks with high market cap are traded by big funds who correct most of the mispricing. Use the Signals directly as weight. In which case, we let the model choose the amount of dollar invested and the leverage of the strategy. On paper this often works well, but it’s risky in real life as you give a lot of decision power to your model. 2. Then you use your weights to build a portfolio with a single return at every timestep. III. Measuring Performance of ML portfolio Remember at the most basic level we want to see high reward (mean return) for low risk (variance or std of return). Hence the basic performance metric is the Sharpe ratios (i.e. annualised mean/std of return). But we also want to check that our performance could not be obtained in a simpler or already known way. Basic table of performance We usually show a table with a split by decile or at least long vs short of, Mean return, Std of return, Sharpe ratios (i.e. annualised mean/std of return). We care about seeing that as decile of signal increase, mean return tends to increase with std fairly stable, i.e Sharpe ratio increase with decile. More importantly, we want to see that long-short portfolio have a high Sharpe ratio. Checking alpha You should always check that your portfolio has performance over a set of benchmarks. Those usually include the market portfolio mktt but often other famous factors like Fama-French HmLt or SmBt (google them if you don’t know them). The regressions look like: rt = α + βmktt + βHmLt + βSmBt + εt (1) What we want to see is that α is statistically significant and positive. It means we have performance that can not be trivially explained by known strategy. When building a very complex model it’s also a good idea to include a simple version of the strategy and check we have power over it. Let’s call this simple strategy simplet , we run, rt = α + βmktt + βHmLt + βSmBt + βsimplet + εt (2) and check that α is still statistically significant and positive. Else it’s likely good to just implement simplet 2 IV. What is special about finance data for ML Unbalanced, Not very stationary, Size of sample depends on modeling choice (number of firms, time) Frictions In real life it’s very costly to: Rebalance the portfolio (bid-ask spread and price impact) Short stocks (it’s very costly to maintain the short leg of a portfolio) The performance on paper of ML often underestimates or simply ignores that. V. Markowitz and OLS vs Penalized Markowitz and Ridge, A. OLS-vs-Ridge 1. A linear regression (OLS) finds the optimal weights β to minimize the mean squared error: minβ {∥R − Xβ∥22 }. 2. The ridge regression expands on this model by adding a penalty term λ: minβ {∥R − Xβ∥22 + λ∥β∥22 }. 3. The penalty term means that the model now has to pay a price every time it wants to buy a beta. When λ = 0 we pay no price and ridge becomes OLS. When λ is very high, we pay a high price and will choose to buy no beta. 4. β is a parameter that we estimate on the train sample, λ is a hyperparameter that we select using performance on the validation sample (see early sections). 5. Penalty (ridge) are more necessary when we have lots of predictors and/or few data. With infinite data, we never need penalization and OLS is always optimal. B. Unpenalized Markowitz Both Markowitz below compute the optimal weight without any penalization. These models are often optimal if we have a lot of observation and few stocks. We have N stocks, a covariance matrix for those stocks Σ and a vector of expected returns µ (see slides for dimensions), and we want to find a vector of weights w. Unconstrained Markowitz Markowitz computes optimal weights with no constraints: w = Σ−1 µ, i.e. the inverse matrix of covariance (Σ−1 ) multiplying the vector of expected return (µ). Markowitz, fix leverage We can also normalize the Markowitz weight so that they −1 sum to 1. w = 1Σ′ Σ−1µµ. 3 C. Penalized Markowitz Penalized Markowitz does to Markowitz what Ridge does to OLS. It’s the EXACT same logic and intuition, except the weights w are the β: w = (Σ−1 + 1λ)µ. Don’t focus too much on the equation specifics try to understand the logic. A few things to keep in mind: Just like Ridge, a penalty diminishes the willingness of the model to select high absolute weights (beta for the ridge). This means that with high penalty we will have lower weights, and with very large penalty weights equal to zero. Just like with Ridge/OLS, penalized Markowitz is comparatively better when we have small sample and/or lots of stocks. VI. Textual Analysis—The old ways. Before LLM arrive, when processing text in finance we did : Bag-of-words models. Where we simply count words associated with meanings. For example we count positive and negative word to build sentiment. Or we count green words to build a measure of green commitment. TF-IDF, a method that transform text into vectors taking into account the frequency of terms in a document, and how frequent it is across a whole corpus. Simple machine leanring model like word2vec or Bert. Those models are already quite complex by any standards, but seems to be overpowered in most financial tasks by LLMs now. VII. LLMs Demystification A. Model’s Structure The main parts Tokenize the input text Add positional encoding to each token. Pass the input through attention heads. Those attention heads. The output of the attention heads. We collapse and concatenate the output of the attention heads. We feed the output of the attention heads to a classic neural network that predicts the next tokens. 4 Tokenizing the text: Tokens are bits of text, e.g. two tokens could make a word ”he” or ”llo.” Each token is written as a set of numbers. There exist some more or less optimal ways of creating the token space so that it encodes as much meaning as possible and is optimal for the second step. Positional encoding We need to tell the LLM which words come in which order. The obvious solution does not work (putting numbers 1,2,...1000 etc.). Because NNET needs as input something bounded (between 0 and 1 for example). The solution is some tricky trigonometric function that encodes the positions as a set of numbers. Attention heads 1. The inside of an attention head is a huge matrix where each point models the im- portance (attention) of one token for another. A bit like off-diagonal points on the covariance matrix model the covariance between two stocks. 2. The attention heads use the input matrix (tokenized and positional  T  encoding) multiple times, in both Q, K, and V! Attention(Q, K, V ) = softmax QK√ dk V 3. Modern LLMs have not one attention head, but many. This is huge and has a lot of parameters. It also means the relationship (attention) between each token is modeled multiple times. 4. We then aggregate the output of those attention heads together and make it one big super-complex tensor (matrix but with more than two dimensions). This tensor represents how the model kinda views the text (important for encoding later on!). This tensor (often called latent representation of the prompt) is the input to the next step. Predicting next token We feed this complex latent representation (output of the attention heads) to a classical neural network. This classical network’s job is to take this latent representation and use it to predict the probability of the next token. 5 B. Training the model All the parameters of all the parts are trained jointly on a huge training sample to predict the next word. The training sample is just a large large set of text, and it’s trained by hiding the previous token. Thanks to Llama3.1 open source policy we know: What the token composition of training set looks like (50% general knowledge, 8% multilingual, the rest is math and code). What the complex procedure looks like for cleaning and processing the dataset. C. Inference When we use the model we get a prompt and want to generate a text the users like. To do so, we do autoregressive prompting and, Generate one token at a time. Update context every time. To generate each token we use the probability distribution from our LLMs trained and just draw from there. We usually limit the distribution to the top-K choices or top-p nucleus of choices. This is to avoid having very rare events where the model just screams a random token that makes no sense. We always put some randomness instead of choosing the top choice (randomness is often modeled with a parameter called temperature), because in practice selecting the top choice always leads to weird responses that feel unnatural to users. The model doesn’t naturally know when to stop talking. We can make it stop producing tokens in many ways: Max length (just define a max length of sentence), which feels a bit abrupt to the user, End-of-sequence token (EOS): in training we model end of sequence in our sample as a token. It’s as if the model learns a token that means stop talking now. When it produces this token at inference, it stops talking. Probability threshold based on the distribution of potential next tokens. e.g., if there is no clear set of tokens to say next, maybe it’s time to stop talking. VIII. Measuring performance Performance in LLM is not obvious. Predicting next token (training goal) is not the same as inference objective (sounding like a good assistant). People have developed a lot of heuristics that can fall in four categories: 1. Static/Live: 6 (a) Static: It’s always the same questions (e.g. math exams). Static has the big problem that LLM developers will cheat and make sure that their model can pass the exam next time. (b) Live: Questions are changing dynamically (e.g. coding competition or chatb- otarena) 2. Ground truth/Human preference Ground Truth: There is an objective answer (e.g. math, code etc.) Human preference: no objective performance and we rely on voting or similar mechanism to estimate human preference (e.g. chatbotarena). IX. Pricing We’ve seen in class that: Running free model (meta’s Llama3.1) on rented hardware is more expensive than renting hardware and private model. This suggests companies are losing money providing those services. X. Fine-tuning We can take the existing model trained on general dataset and make it excel at specific tasks, Takes surprisingly little amount of data, Can be done easily with official API (for a price) XI. LLM and portfolios A. Encoding text In finance we often want to transform text into vectors that: Capture as much meaning as possible, Lose as little information as possible, Is as small as possible, Is as interpretable as possible. If you stop the LLMs after the attention head you get a latent representation that can be used as input for another model. 7 You know this latent representation is good because it can produce such an impressive output when given to the second half of the LLMs. We’ve seen in class that you can use this input to create an encoding that has: – Super high meaning, – Very decent completeness (lose not too much information) – Is of manageable size. – Has no human interpretation. – AND can be used as input to build a portfolio that produces high sharpe ratios (news example seen in class). Hence, carry a lot of economic meaning when used correctly. XII. Results from the paper seen in class With a large number of news and alerts (smaller faster news, simply headlines) We build portfolio and obtain super high sharpe. As often, we have higher Sharpe ratio with EW than VW. If we encode with simpler model, we get worse performance than a complex model. When using complex model, we need a lot of data and the model fails to perform when applied to countries with only few firms and news. Most of the performance is obtained on the first day after the news. This is especially true for Articles (as opposed to alerts) and large firms. XIII. Analyst Chain-of-thought is a technique to extract reasoning (or something that looks like it) from LLMs. It consists of adding leading questions to encourage reasoning before asking the questions. When comparing LLMs to analysts’ forecasts of earnings per share (EPS) researchers found that: Analysts are at par with simple non-chain-of-thought LLMs. Chain-of-thought LLMs crush analysts. When using the same numbers as input to a classical artificial neural network (ANN)1 , the ANN performs at par with the LLM and, quite surprisingly, makes very similar predictions. 1 ANN (artificial neural network) and FNN (feed forward network) or Classical Network, are, in this class, the same thing. We just adapt to the notation of different papers. All just means: a classical neural network that takes some input and makes some direct prediction 8 The LLMs did not exhibit the same prediction bias as the analysts (e.g. Book to market bias). But did tend to perform badly on smaller firms as the LLMs when predictions seem to simply be harder (e.g. small firms or high volatility of earnings). XIV. Other stuff to prepare Please note again that all the help will only get you almost ready. To be fully ready for the exam, you also need to: Understand every figure and table that has been discussed in class (if it’s in the slides but not discussed in class, you don’t need to be able to discuss it). Any additional discussion we spent lots of time on in class but not included in this summary can be in the exam, although this covers the essentials to already get a very good grade. Make sure you UNDERSTAND everything above. Learning it by heart without un- derstanding what it means will be of little help during the exam. Make sure you understand the code in every solution of our tutorials and exams. You will NOT have to write code in the exam. However, you should be able to see code you’ve seen before and explain what it does. 9

Use Quizgecko on...
Browser
Browser