Deep Learning at Safran: part 5

# Lecture 4:
## Going deeper

.grid[
.kol-6-12[
Marc Lelarge*
 .red[Andrei Bursuc]
 Alexandre Defossez
]
.kol-6-12[
Alexandre Sablayrolles
 Pierre Stock
 Neil Zeghidour

]
]

---

# Outline

- Universal approximation theorem

- Why going deeper?

- Regularization in deep networks
  + classic regularization: $L\_2$ regularization
  + implicit regularization: Dropout, Batch Normalization

- Residual networks

---

# Going deeper

---

# Universal function approximation

.bold[Theorem.] ( Hornik et al, 1991) Let $\sigma$ be a nonconstant, bounded, and monotonically-increasing
continuous function. For any $f \in C([0, 1]^{d})$ and $\varepsilon >
0$, there exists $h \in \mathbb{N}$ real constants $v\_i, b\_i \in
\mathbb{R}$ and real vectors $w_i \in \mathbb{R}^d$ such that:

$$ | \sum\_i^h v\_i \sigma(w\_i^Tx + b\_i) - f (x) | < \varepsilon $$

that is: neural nets are dense in $C([0, 1]^{d})$.

- It guarantees that even a single hidden-layer network can represent any classification
  problem in which the boundary is locally linear (smooth);
- It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
- The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).

---

.bold[Theorem] (Barron, 1992) The mean integrated square error between the estimated network $\hat{F}$ and the target function $f$ is bounded by
$$O\left(\frac{C^2\_f}{q} + \frac{qp}{N}\log N\right)$$
where $N$ is the number of training points, $q$ is the number of neurons, $p$ is the input dimension, and $C\_f$ measures the global smoothness of $f$.

Provided enough data, it guarantees that adding more neurons will result in a better approximation.

---
# Problem solved?

UFA theorems **do not tell us**:

- The number $h$ of hidden units is small enough to have the network fit
  in RAM.

- The optimal function parameters can be found in finite time by
  minimizing the Empirical Risk with SGD and the usual random
  initialization schemes.

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

x = np.linspace(-3, 3, 1000)
*y = ( rect(x, -1, 0, 0.4))

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

x = np.linspace(-3, 3, 1000)
*y = (  rect(x, -1, 0, 0.4)
*    + rect(x,  0, 1, 1.3)
*    + rect(x,  1, 2, 0.8))

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.05) # 10
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.25) # 20
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.1) # 50
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.01) # 500
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---

# Approximation with ReLU nets