Full Transcript

# Lecture 15: November 16, 2023 ## Reminders * HW 7 due tonight * **Final Project Proposal** due tomorrow night ## Final Project * 3 weeks to work on it * Grading * Proposal (5%) * Presentation (10%) * Code (50%) * Write-up (35%) ## Write-up * No more than 6 p...

# Lecture 15: November 16, 2023 ## Reminders * HW 7 due tonight * **Final Project Proposal** due tomorrow night ## Final Project * 3 weeks to work on it * Grading * Proposal (5%) * Presentation (10%) * Code (50%) * Write-up (35%) ## Write-up * No more than 6 pages in NeurIPS format. * Include: * Introduction * Related work * Method * Experiments * Conclusion ## Today * Continue learning theory ## Last time * Rademacher complexity * Model complexity * Generalization bounds ## Recall: Generalization bounds $$ R(h) \leq \hat{R}(h) + \sqrt{\frac{\log(1/\delta)}{2m}} + \mathfrak{R}_m(\mathcal{H}) $$ where $\mathfrak{R}_m(\mathcal{H})$ is the Rademacher complexity of hypothesis class $\mathcal{H}$ ## Rademacher Complexity Informally, measures the ability of the hypothesis space to fit random noise. **Definition:** Let $\mathcal{H}$ be a class of functions mapping from $\mathcal{Z}$ to $\{-1,+1\}$. The empirical Rademacher complexity of $\mathcal{H}$ with respect to the sample $S = \{z_1, \dots, z_m\}$ is $$ \hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma} \left[ \sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \sigma_i h(z_i) \right]. $$ where $\sigma = (\sigma_1, \dots, \sigma_m)$ is a vector of i.i.d. uniform random variables taking values in $\{-1, +1\}$. The Rademacher complexity is the expectation of the empirical Rademacher complexity over all samples of size m drawn from distribution $\mathcal{D}$: $$ \mathfrak{R}_m(\mathcal{H}) = \mathbb{E}_{S \sim \mathcal{D}^m}[\hat{\mathfrak{R}}_S(\mathcal{H})] $$ * How well can our classifier do if the labels are random? ## Example: Neural Net * $\mathcal{H}$ = neural net with $k$ layers, each layer has at most $d$ nodes, with weights bounded by $B$. * Then, $$ \mathfrak{R}_m(\mathcal{H}) = \mathcal{O}\left(\frac{B \cdot k \cdot d \sqrt{\log(d)}}{m}\right) $$ * This bound depends linearly on the depth. * But in practice, deeper networks generalize better * **Takeaway: Rademacher complexity can be a loose bound** ## PAC-Bayes * Another approach to generalization bounds * **Key idea:** Instead of bounding the error of a single hypothesis $h$, bound the average error over a distribution of hypotheses. ## Setup * Let $\mathcal{H}$ be a set of hypotheses * Let $P$ be a prior distribution over $\mathcal{H}$ * Let $Q$ be a posterior distribution over $\mathcal{H}$ ## Gibbs Algorithm 1. Draw a hypothesis $h$ from $Q$ 2. Make a prediction using $h$ ## PAC-Bayes Bound $$ \mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{KL(Q||P) + \log(\frac{m}{\delta})}{2(m-1)}} $$ with probability at least $1 - \delta$ (over the choice of the training set). * $KL(Q||P)$ is the Kullback-Leibler divergence between $Q$ and $P$ * $KL(Q||P) = \sum_{h \in \mathcal{H}} Q(h) \log(\frac{Q(h)}{P(h)})$ ## Intuition * $KL(Q||P)$ measures how different the posterior is from the prior * If the posterior is very different from the prior, then we need more data to generalize well ## Example: Binary Classification * $\mathcal{H}$ = set of all possible classifiers * $P$ = uniform distribution over $\mathcal{H}$ ## How to choose Q? * We want $Q$ to be close to $P$ (to keep $KL(Q||P)$ small) * We also want $Q$ to put more weight on hypotheses that do well on the training data $$ Q(h) \propto P(h) \exp(-\lambda \hat{R}(h)) $$ Where $\lambda$ is a hyperparameter that controls the trade-off between fitting the data and staying close to the prior. ## Contrast with Rademacher Complexity * Rademacher complexity bounds the worst-case error over all hypotheses in $\mathcal{H}$ * PAC-Bayes bounds the average error over a distribution of hypotheses ## Benefits of PAC-Bayes * Can handle complex hypothesis spaces * Can incorporate prior knowledge * Can be tighter than Rademacher complexity bounds * Allows us to reason about uncertainty in our predictions