Session 4: From Supervised Learning to Optimization
Setting
We have $n$ labeled instances $(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})$ that are independent and identically distributed (i.i.d.) according to a distribution $(X, Y)$.
Goal
Find a good predictor $\hat{y} = f(x)$ with $f: \mathcal{X} \to \mathcal{Y}$, where:
- $\mathcal{X}$ is the sample space
- $\mathcal{Y}$ is the label space
Framing our Goal as an Optimization Problem
Step 1: Define a Loss Function
To measure prediction quality, we define a loss function:
\[\text{loss}: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}\]Examples: Prediction error, squared loss, cross-entropy loss
Step 2: Define the Risk (Expected Loss)
We want to minimize the risk, which is the expected value of the loss function:
\[R[f] = \mathbb{E}\left[\text{loss}(f(X), Y)\right]\]Step 3: Empirical Risk (Practical Approximation)
Since we don’t have access to the true distribution $(X, Y)$, we only have samples. Therefore, we define the empirical risk:
\[R_S[f] = \frac{1}{n}\sum_{i=1}^{n} \text{loss}(f(x_i), y_i)\]Step 4: Empirical Risk Minimization (ERM)
The optimization problem becomes:
\[\min_{f \in \mathcal{F}} R_S[f]\]where $\mathcal{F}$ is a hypothesis class (class of functions).
Examples of $\mathcal{F}$:
- Linear models: $f(x) = w^T x + b$
- Neural networks
- Decision trees
- Kernel methods
Generalization Gap
The relationship between true risk and empirical risk is:
\[R[f] = R_S[f] + \underbrace{R[f] - R_S[f]}_{\text{generalization gap}}\]The generalization gap measures how well the model performs on unseen data compared to the training data. A small generalization gap indicates good generalization.
Three Interconnected Problems
When solving machine learning problems, we face three interconnected challenges:
- Representation: How do we choose the hypothesis class $\mathcal{F}$?
- Optimization: How can we efficiently solve the ERM problem?
- Generalization: Will the prediction performance transfer from training examples to unseen instances?
In this session, we focus on the optimization aspect.
Optimization Basics
The general optimization problem is:
\[\min_{w \in \mathbb{R}^d} \Phi(w)\]where $\Phi: \mathbb{R}^d \to \mathbb{R}$ is the objective function.
In the context of ERM, if we parameterize our hypothesis class $\mathcal{F}$ by weights $w$, then:
\[\Phi(w) = R_S[f_w] = \frac{1}{n}\sum_{i=1}^{n} \text{loss}(f_w(x_i), y_i)\]Example: Binary Logistic Regression
We now derive the gradient for binary logistic regression with 2 features as a concrete example of how optimization connects to machine learning.
Derivation of the Gradient for Binary Logistic Regression
1. Model Specification
For a single training example $(\mathbf{x}, y)$ where
\[\mathbf{x} = \begin{bmatrix}x_{1}\\ x_{2}\end{bmatrix}, \qquad y \in \{0,1\},\]the logistic (sigmoid) function gives the predicted probability of the positive class:
\[\hat{p} = P(y=1 \mid \mathbf{x}) = \sigma(z), \qquad z = w_{0} + w_{1}x_{1} + w_{2}x_{2}\] \[\sigma(z) = \frac{1}{1 + e^{-z}}\]Here, $w_{0}$ is the intercept (bias), and $w_{1}, w_{2}$ are the weights for the two features.
The parameter vector is $\mathbf{w} = [w_0, w_1, w_2]^T \in \mathbb{R}^3$.
2. Loss Function (Binary Cross-Entropy)
For one observation, the negative log-likelihood (binary cross-entropy) is
\[\ell(\mathbf{w}) = -\left[ y\log(\hat{p}) + (1-y)\log(1-\hat{p})\right]\]Substituting $\hat{p} = \sigma(z)$ gives
\[\ell(\mathbf{w}) = -\left[ y\log\sigma(z) + (1-y)\log(1-\sigma(z))\right]\]This is our objective function $\Phi(\mathbf{w})$ for a single example.
3. Useful Derivative of the Sigmoid
\[\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))\]4. Derivative of the Loss w.r.t. the Linear Predictor $z$
We compute:
\[\frac{\partial \ell}{\partial z} = -\left[ y\frac{1}{\sigma(z)}\frac{d\sigma(z)}{dz} + (1-y)\frac{1}{1-\sigma(z)}\left(-\frac{d\sigma(z)}{dz}\right) \right]\] \[= -\left[ y(1-\sigma(z)) - (1-y)\sigma(z) \right]\] \[= \sigma(z) - y = \hat{p} - y\]Therefore:
\[\boxed{\frac{\partial \ell}{\partial z} = \hat{p} - y}\]5. Gradient w.r.t. Each Weight
Because $z = w_{0} + w_{1}x_{1} + w_{2}x_{2}$, we have:
\[\frac{\partial z}{\partial w_{0}} = 1, \qquad \frac{\partial z}{\partial w_{1}} = x_{1}, \qquad \frac{\partial z}{\partial w_{2}} = x_{2}\]Applying the chain rule $\displaystyle\frac{\partial \ell}{\partial w_j} = \frac{\partial \ell}{\partial z}\frac{\partial z}{\partial w_j}$:
Parameter | Gradient |
---|---|
$w_{0}$ (bias) | $\displaystyle \frac{\partial \ell}{\partial w_{0}} = (\hat{p} - y)$ |
$w_{1}$ | $\displaystyle \frac{\partial \ell}{\partial w_{1}} = (\hat{p} - y)\,x_{1}$ |
$w_{2}$ | $\displaystyle \frac{\partial \ell}{\partial w_{2}} = (\hat{p} - y)\,x_{2}$ |
In vector form:
\[\nabla_{\mathbf{w}} \ell(\mathbf{w}) = (\hat{p} - y) \begin{bmatrix} 1\\ x_{1}\\ x_{2} \end{bmatrix}\]6. Gradient for the Entire Dataset
For $n$ i.i.d. training examples ${(\mathbf{x}^{(i)}, y^{(i)})}_{i=1}^{n}$, the objective function (average loss) is:
\[\Phi(\mathbf{w}) = L(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n}\ell^{(i)}(\mathbf{w})\]Its gradient is:
\[\boxed{\nabla_{\mathbf{w}} \Phi(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n} (\hat{p}^{(i)} - y^{(i)}) \begin{bmatrix} 1\\ x^{(i)}_{1}\\ x^{(i)}_{2} \end{bmatrix}}\]Or component-wise:
\[\frac{\partial \Phi}{\partial w_{j}} = \frac{1}{n}\sum_{i=1}^{n} (\hat{p}^{(i)} - y^{(i)}) \, x^{(i)}_{j}\]where $x^{(i)}_{0} = 1$ (for the intercept) and $j \in {0, 1, 2}$.
7. Summary of the Derivation
- Model: $z = w_{0} + w_{1}x_{1} + w_{2}x_{2}$, $\hat{p} = \sigma(z)$
- Loss: Binary cross-entropy $\ell = -[y\log\hat{p} + (1-y)\log(1-\hat{p})]$
- Key result: $\partial\ell/\partial z = \hat{p} - y$
- Chain rule: Apply to each weight using $\partial z/\partial w_j$
- Final gradient: $\nabla_{\mathbf{w}}\ell = (\hat{p} - y)[1,\,x_{1},\,x_{2}]^{\top}$
These gradients are used in gradient descent (or more advanced optimizers) to iteratively update and learn the optimal parameters $w_{0}, w_{1}, w_{2}$ that minimize the empirical risk.
Gradient Descent Algorithm
Now that we can compute the gradient $\nabla_{\mathbf{w}} \Phi(\mathbf{w})$, we can use it to minimize the objective function. Gradient descent is an iterative optimization algorithm that moves in the direction of steepest descent.
The Intuition
The gradient $\nabla_{\mathbf{w}} \Phi(\mathbf{w})$ points in the direction of steepest ascent of the function $\Phi$. Therefore, to minimize $\Phi$, we should move in the opposite direction of the gradient.
The Update Rule
Starting from an initial guess $\mathbf{w}^{(0)}$, we iteratively update the parameters:
\[\boxed{\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha_t \nabla_{\mathbf{w}} \Phi(\mathbf{w}^{(t)})}\]where:
- $t$ is the iteration number
- $\alpha_t > 0$ is the step size (or learning rate) at iteration $t$
- $\nabla_{\mathbf{w}} \Phi(\mathbf{w}^{(t)})$ is the gradient evaluated at $\mathbf{w}^{(t)}$
Algorithm: Batch Gradient Descent
Input: Training data ${(\mathbf{x}^{(i)}, y^{(i)})}{i=1}^{n}$, step size sequence ${\alpha_t}{t=0}^{T-1}$, number of iterations $T$
Initialize: $\mathbf{w}^{(0)}$ (e.g., randomly or to zeros)
For $t = 0, 1, 2, \ldots, T-1$:
-
Compute the gradient over the entire dataset:
\[\nabla_{\mathbf{w}} \Phi(\mathbf{w}^{(t)}) = \frac{1}{n}\sum_{i=1}^{n} \nabla_{\mathbf{w}} \ell^{(i)}(\mathbf{w}^{(t)})\] -
Update the parameters:
\[\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha_t \nabla_{\mathbf{w}} \Phi(\mathbf{w}^{(t)})\]
Output: $\mathbf{w}^{(T)}$
Example: Gradient Descent for Logistic Regression
For binary logistic regression with 2 features, the update becomes:
\[\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \frac{\alpha_t}{n} \sum_{i=1}^{n} (\hat{p}^{(i)} - y^{(i)}) \begin{bmatrix} 1\\ x^{(i)}_{1}\\ x^{(i)}_{2} \end{bmatrix}\]Or component-wise:
\[w_0^{(t+1)} = w_0^{(t)} - \frac{\alpha_t}{n} \sum_{i=1}^{n} (\hat{p}^{(i)} - y^{(i)})\] \[w_1^{(t+1)} = w_1^{(t)} - \frac{\alpha_t}{n} \sum_{i=1}^{n} (\hat{p}^{(i)} - y^{(i)}) x^{(i)}_{1}\] \[w_2^{(t+1)} = w_2^{(t)} - \frac{\alpha_t}{n} \sum_{i=1}^{n} (\hat{p}^{(i)} - y^{(i)}) x^{(i)}_{2}\]Choosing the Step Size
The step size $\alpha_t$ is a crucial hyperparameter that can vary with iteration $t$:
- Too small: Convergence is very slow, requiring many iterations
- Too large: The algorithm may overshoot the minimum and diverge
- Just right: Fast convergence to the minimum
Common strategies:
1. Constant step size: $\alpha_t = \alpha$ for all $t$
- Simple but may not be optimal
2. Diminishing step size: $\alpha_t$ decreases over time
- Example: $\alpha_t = \frac{\alpha_0}{1 + t}$ or $\alpha_t = \frac{\alpha_0}{\sqrt{t+1}}$
- Ensures convergence for convex problems under certain conditions
3. Exponential decay: $\alpha_t = \alpha_0 \cdot \gamma^t$ where $0 < \gamma < 1$
- Common in practice (e.g., $\gamma = 0.95$ or $0.99$)
4. Step decay: Reduce $\alpha_t$ by a factor every few epochs
- Example: $\alpha_t = \alpha_0 \cdot 0.5^{\lfloor t/k \rfloor}$ where $k$ is the decay interval
5. Line search: Choose $\alpha_t$ at each iteration to minimize $\Phi(\mathbf{w}^{(t)} - \alpha_t \nabla_{\mathbf{w}} \Phi(\mathbf{w}^{(t)}))$
- More computationally expensive but can be more efficient
6. Adaptive methods: Adjust $\alpha_t$ based on gradient history
- Examples: AdaGrad, RMSprop, Adam
Stopping Criteria
Gradient descent continues until one of the following conditions is met:
- Maximum iterations: $t \geq T$
- Gradient magnitude: $\lVert\nabla_{\mathbf{w}} \Phi(\mathbf{w}^{(t)})\rVert < \epsilon$ for some small threshold $\epsilon > 0$
- Function value change: $\lvert\Phi(\mathbf{w}^{(t+1)}) - \Phi(\mathbf{w}^{(t)})\rvert < \epsilon$
- Parameter change: $\lVert\mathbf{w}^{(t+1)} - \mathbf{w}^{(t)}\rVert < \epsilon$
Variants of Gradient Descent
1. Batch Gradient Descent (Standard GD)
Uses all $n$ training examples to compute the gradient at each iteration.
\[\nabla_{\mathbf{w}} \Phi(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n} \nabla_{\mathbf{w}} \ell^{(i)}(\mathbf{w})\]- Pros: Stable convergence, accurate gradient
- Cons: Slow for large datasets (requires full pass through data)
2. Stochastic Gradient Descent (SGD)
Uses one randomly selected training example at each iteration.
Update rule:
\[\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha_t \nabla_{\mathbf{w}} \ell^{(i_t)}(\mathbf{w}^{(t)})\]where $i_t$ is randomly sampled from ${1, \ldots, n}$.
- Pros: Much faster per iteration, can escape local minima
- Cons: Noisy updates, requires careful step size tuning
3. Mini-Batch Gradient Descent
Uses a subset (mini-batch) of $b$ examples at each iteration.
Update rule:
\[\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \frac{\alpha_t}{b} \sum_{i \in \mathcal{B}_t} \nabla_{\mathbf{w}} \ell^{(i)}(\mathbf{w}^{(t)})\]where $\mathcal{B}_t$ is a mini-batch of size $b$ (typically 32, 64, 128, or 256).
- Pros: Balance between computational efficiency and gradient accuracy, can leverage parallel computation
- Cons: One more hyperparameter (batch size)
Mini-batch GD is the most commonly used variant in practice, especially for training neural networks.
Convergence Properties
For convex functions $\Phi$ (like logistic regression):
- Gradient descent with appropriate step size sequence is guaranteed to converge to the global minimum
- Convergence rate: $O(1/T)$ for smooth functions with proper step size choice
- Typical requirement: $\sum_{t=0}^{\infty} \alpha_t = \infty$ and $\sum_{t=0}^{\infty} \alpha_t^2 < \infty$ (e.g., $\alpha_t = \frac{1}{t+1}$)
For non-convex functions (like neural networks):
- Gradient descent can only guarantee convergence to a local minimum or saddle point
- In practice, local minima found by GD often work well for machine learning tasks