Embeddings [Marc Lelarge]

class: center, middle, title-slide
count: false

# Embeddings

<br/><br/>
.bold[Marc Lelarge]

---

# Tip of the week: Dataloading

--
count: false

## Dataset class

`torch.utils.data.Dataset` is an abstract class representing a dataset. Your custom dataset should inherit `Dataset` and override the following methods:

- `__len__` so that `len(dataset)` returns the size of the dataset.
  - `__getitem__` to support the indexing such that `dataset[i]` can be used to get ith sample

--
count: false

## Iterating through the dataset with `Dataloader`

By using a simple `for` loop to iterate over the data, we are missing out on:

-  Batching the data
  -  Shuffling the data
  -  Load the data in parallel using multiprocessing workers.

`torch.utils.data.DataLoader` is an iterator which provides all these features.

---

## Examples (1)

In the first lesson, we created two datasets, one for the training and one for the validation:
```
from torchvision import transforms,datasets

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

vgg_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])

dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), vgg_format) for x in ['train', 'valid']}
```

Hence `len(dsets['train'])` returns `23000`, i.e. the number of images in the training set, and more precisely, the number of files located at `data_dir/train/`.

Recall that `data_dir/train/` (and similarly `data_dir/valid/`) is split in two folders: `cats/` and `dogs/`, you can check that each of these folders have `11500` images with: `ls | wc -l`

You can recover the classes with `dsets['train'].classes` which returns `['cats', 'dogs']`
These are features of the `torchvision.datasets.ImageFolder` Module.

--
count: false

More importantly, what returns `dsets['train'][0]`?

--
count: false

Answer: a tuple containing a tensor and a label.

---
##  Examples (1)

To obtain a dataloader for the training set:
```
train_loader = torch.utils.data.DataLoader(dsets['train'], batch_size=64, shuffle=True, num_workers=6)
```

--
count: false

Then, you can use it as follows:
```
for input, label in train_loader:
    output = model(input)
    loss = loss_fn(output, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    ....```

---

## Examples (2)

In the first lesson, we first precomputed features and converted them as `numpy` arrays in order to store them. You first need to load the features and the labels:
```
feat_train = load_array(os.path.join(data_dir_colab,'vgg16','feat_train.bc'))
labels_train = load_array(os.path.join(data_dir_colab,'vgg16','lbs_train.bc'))
```

and then, you can create a list for your dataset as follows:
```
train_dataset =
[[torch.from_numpy(f).type(torch.float),torch.tensor(l).type(torch.double)] for (f,l) in zip(feat_train,labels_train)]
```

A list has a buil-in function `len()` and a `__getitem__()` method, hence this is a valid dataset for PyTorch.

--
count: false

To create a dataloader, it is as simple as:
```
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
```

---

## Examples (3)

Today, we will make our own dataloader from scratch using a Python iterator:
```
def minibatch(batch_size, *tensors):
    if len(tensors) == 1:
        tensor = tensors[0]
        for i in range(0, len(tensor), batch_size):
            yield tensor[i:i + batch_size]
    else:
        for i in range(0, len(tensors[0]), batch_size):
            yield tuple(x[i:i + batch_size] for x in tensors)
```

--
count: false

We will use it as follows:
```
for (minibatch_num, (batch_user, batch_item, batch_rating)) in enumerate(minibatch(batch_size, user_ids_tensor,
                                                                                     item_ids_tensor, ratings_tensor)):
    ...
```
where the user and item ids are obtained from `numpy` integers and ratings from `numpy` floats as follows: 
```
user_ids_tensor = torch.from_numpy(users)
item_ids_tensor = torch.from_numpy(items)
ratings_tensor = torch.from_numpy(ratings)
```

---

## To know more about dataloading

have a look at the related [PyTorch tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)

--
count: false

## Now: how to deal with symbolic data like ids?

---

# One-hot encoding

Encoding colors:
.center.width-40[![](images/one-hot.jpg)]

Why not simply use: blue = 0, red = 1 and so on?
What are the Pros/Cons of such a representation?

--
count: false

- Each axis has a meaning
- Sparse, discrete representation with large dimension (size of the vocabulary).
- Symbols are equidistant from each other with euclidean distance $\sqrt{2}$ for large vocabulary.

---
# Embeddings

Idea: project in $\mathbb{R}^d$ with $d$ much smaller than the size $n$ of the vocabulary!

--
count: false

More formaly, let $W\in \mathbb{R}^{n\times d}$, we define
$$
\text{embedding}(x) = \text{onehot}(x) W
$$

$W$ is typically randomly initialized, then tuned by backprop.

$W$ are trainable parameters of the model.

--
count: false

- Continuous and dense representation.
- Can represent a huge vocabulary in low dimension.
- Axis have no meaning a priori but once trained, embedding metric can capture semantic distance.

---

# Example: embeddings for movies in recommender systems

.center.width-50[![](images/Pca_viz.png)]

PCA of embeddings for movies learned in this [practical](https://github.com/mlelarge/dataflowr/blob/master/CEA_EDF_INRIA/03_collaborative_filtering_empty_colab.ipynb)

.citation.tiny[source Fast.ai]

---

## Embeddings in PyTorch

[Sparse layers](https://pytorch.org/docs/master/nn.html#embedding) in PyTorch: `torch.nn.Embedding(num_embeddings, embedding_dim)`

Example: creating embeddings for users
```
embedding_dim = 3
embedding_user = nn.Embedding(total_user_id, embedding_dim)
input = torch.LongTensor([[1,2,4,5],[4,3,2,0]])
embedding_user(input)
```
--
count: false

returns
```
tensor([[[-1.7219,  0.4682,  0.5729],
         [ 0.0657, -0.7046, -0.4397],
         [ 1.1512,  1.3380, -1.0272],
         [ 1.4405,  1.2945,  1.0051]],

[[ 1.1512,  1.3380, -1.0272],
         [-0.6698, -2.1828, -0.1695],
         [ 0.0657, -0.7046, -0.4397],
         [-0.7966, -1.6769, -0.7454]]])
```

--
count: false
So now is time to play with embeddings!

---
class: end-slide, center
count: false

The end.