Optimization fro deep learning [Marc Lelarge]

class: center, middle, title-slide
count: false

# Optimization for deep learning

<br/><br/>
.bold[Marc Lelarge]

---

# (1) Optimization and deep learning

## Gradient descent, stochastic gradient descent and mini-batch SGD

# (2) Gradient descent optimization algorithms

## Momentum, Nesterov accelerated gradient, Adagrad, RMSProp, Adam, AMSGrad

# (3) PyTorch optimizers

---

# (1) Some warnings about optimization in deep learning?

The objective function of an optimization algorithm is usually a loss function based on the training data set, hence the goal of optimization is to reduce the _training error_.

--
count: false

However, the goal of (deep) learning is to reduce the _generalization error_.

--
count: false

In order to reduce the generalization error, we need to pay attention to _overfitting_ in addition to using the optimization algorithm to reduce the training error.

--
count: false

In this course, we focus specifically on the _performance_ of the optimization algorithm in minimizing the objective function, rather than the model’s generalization error.

In the next course, we will see techniques to avoid _overfitting_.

--
count: false

No theorem in this lecture (for theorems, see convex optimization course). We focus on intuitions because deep learning optimization problems are not convex and we hope our _intuition_ built on convex problems will be useful!

---
# (1) What do we optimize?

Recall the simple linear regression where the goal is to minimize the .bold[cost function]:

$$
J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-\theta^T x(i))^2.
$$

--
count: false

For a deep neural network $F(.;\theta)$ with parameters $\theta$, we minimize the objective function:

$$
J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-F(x(i);\theta))^2,
$$
in the paremter $\theta$ on the _training set_.

This problem is not convex anymore but we can still compute the gradient of the cost function $\nabla J(\theta)$ with respect to the parameters (well PyTorch do it for us!).

---

# (1) Gradient descent variants

Idea: update the parameters in the opposite direction of the gradient.

The _learning rate_ $\eta$ determines the size of the steps.

--
count: false

## Batch gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t)
$$

--
count: false

## Stochastic gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i),y(i))
$$

--
count: false

## Mini-batch gradient descent
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i:i+b),y(i:i+b)),
$$
where $b$ is the batch size.

---

# (1) Challenges

Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used!

--
count: false

- Choosing a proper learning rate can be difficult. How to adapt the learning rate during training?

- Why applying the same learning rate to all parameters updates?

- How to escape saddle points where the gradient is close to zero in all dimension?

--
count: false

In the rest of the lecture, we will introduce _modifications to SGD_.

Also we will apply these modifications to mini-batch gradient descent, we omit the explicit reference to the batch $x(i:i+b),y(i:i+b)$. Hence we write for SGD:
$$
\theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t),
$$
but we need to keep in mind that updates are performed for every mini-batch of $b$ training samples.

---

# (2) Gradient descent optimization algorithms

A nice [survey](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder

Please run the [jupyter notebook](https://github.com/mlelarge/dataflowr/blob/master/Gradient_descent_optimization_algorithms_empty.ipynb) in parallel.

We will deal with the toy problem: minimize in $x_1,x_2$, the function:
$$
f(x\_1, x\_2) = 0.1 x\_1^2+2 x\_2^2.
$$

--
count: false

You will need to implement all variations of SGD.

Have a look at the `train_2d()` function and the Gradient descent code to understand what this function does.

---

# (2) Momentum

Accelerating SGD by dampening oscillations, i.e. by averaging the last values of the gradient.

--
count: false

$$\begin{aligned}
v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t)\\\\
\theta\_{t+1} &= \theta\_t - v\_{t+1}
\end{aligned}$$

--
count: false

Why does it work?

With $g\_t = \nabla J(\theta\_t)$, we have for any $k>0$:
$$
v\_{t+1} = \gamma^k v\_{t-k} +\eta \underbrace{\sum\_{i=0}^k \gamma^i g\_{t-i}}_{\text{average of last gradients}}
$$

Typical value for $\gamma= 0.9$.

---
# (2) Nesterov accelerated gradient

With momentum, we first compute the gradient and then make a step following our momentum and add the gradient. Nesterov proposed to first make the step following the momentum and then adjusting by computing the gradient locally:

$$\begin{aligned}
v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t-\gamma v\_{t})\\\\
\theta\_{t+1} &= \theta\_t - v\_{t+1}
\end{aligned}$$

.citation[Nesterov, A method for solving the convex programming problem with convergence rate O(1/k^2), Dokl. Akad. Nauk SSSR 1983]
---

# (2) Adagrad

We would like to adapt our updates to each individual parameter, i.e. have a different decreasing learning rate for each parameter.

$$\begin{aligned}
s\_{t+1,i} &= s\_{t,i} + \nabla J(\theta\_t)\_i^2\\\\
\theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i
\end{aligned}$$

--
count: false

No  manual tuning of the learning rate.

Typical default values: $\eta=0.01$ and $\epsilon = 1e-8$.

.citation[Duchi et al., [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://jmlr.org/papers/v12/duchi11a.html), JMLR 2011]

---

# (2) RMSProp

Problem with Adagrad, learning rate goes to zero and never forgets about the past.

--
count: false

Idea proposed by G. Hinton in his coursera class: use exponential average.

$$\begin{aligned}
s\_{t+1,i} &= \gamma s\_{t,i} + (1-\gamma) \nabla J(\theta\_t)\_i^2\\\\
\theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i
\end{aligned}$$

--
count: false

With a slight abuse of notation, we re-write the update as follows:
$$\begin{aligned}
s\_{t+1} &= \gamma s\_{t} + (1-\gamma) \nabla J(\theta\_t)^2\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{s\_{t+1}+\epsilon}}\nabla J(\theta\_t)
\end{aligned}$$

Typical values: $\gamma = 0.9$ and $\eta = 0.001$.

.citation[Hinton [Coursera lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) ]

---

# (2) Adam

Mixing ideas from RMSProp and momentum, we get Adam = Adaptive moment Estimation.

$$\begin{aligned}
m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\
v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\
\hat{m}\_{t+1} &= \frac{m\_{t+1}}{1-\beta\_1^{t+1}}\\\\
\hat{v}\_{t+1} &= \frac{v\_{t+1}}{1-\beta\_2^{t+1}}\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} \hat{m}\_{t+1}
\end{aligned}$$

$\hat{m}\_t$ and $\hat{v}\_t$ are estimates for the first and second moments of the gradients. Because $m_0=v_0=0$, these estimates are biased towards $0$, the factors $(1-\beta^{t+1})^{-1}$ are here to counteract these biases.
--
count: false

Typical values: $\beta_1=0.9$, $\beta_2 =0.999$ and $\epsilon=1e-8$.

.citation[Kingma et al. , [Adam: a Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980), ICLR 2015]

---

# (2) AMSGrad

Sometimes, Adam forgets too fast, hence to fix it, we replace the moving average by a $\max$:

$$\begin{aligned}
m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\
v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\
\hat{m}\_{t+1} &= \frac{m\_{t+1}}{1-\beta\_1^{t+1}}\\\\
\hat{v}\_{t+1} &= \max\left(\hat{v}\_{t}, \frac{v\_{t+1}}{1-\beta\_2^{t+1}}\right)\\\\
\theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} \hat{m}\_{t+1}
\end{aligned}$$

.citation[Reddi et al.,[On the Convergence of Adam and Beyond](https://openreview.net/forum?id=ryQu7f-RZ) ICLR 2018]
---

# (3) PyTorch optimizers

All located in `torch.optim.*`.

All have similar constructor `torch.optim.*(params, lr=..., momentum=...)`.
Default values are different for all optimizer, check the doc.

`params` should be a list of all tensors to optimize over.
Can be obtained from any module with `module.parameters()`.

--
count: false

## Learning rate schedule

Once the loss is no longer decreasing, reducing the learning rate can help. Typical approach in vision, divide learning rate by D every T epochs.

```
model = ...
optimizer = torch.optim.SGD(model.parameters(), "lr": 0.01, "momentum": 0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)
for epoch in range(epochs):
    scheduler.step()
    ...
```

---
class: end-slide, center
count: false

The end.