class: center, middle, title-slide count: false # Embeddings
.bold[Marc Lelarge] --- # Tip of the week: Dataloading -- count: false ## Dataset class `torch.utils.data.Dataset` is an abstract class representing a dataset. Your custom dataset should inherit `Dataset` and override the following methods: - `__len__` so that `len(dataset)` returns the size of the dataset. - `__getitem__` to support the indexing such that `dataset[i]` can be used to get ith sample -- count: false ## Iterating through the dataset with `Dataloader` By using a simple `for` loop to iterate over the data, we are missing out on: - Batching the data - Shuffling the data - Load the data in parallel using multiprocessing workers. `torch.utils.data.DataLoader` is an iterator which provides all these features. --- ## Examples (1) In the first lesson, we created two datasets, one for the training and one for the validation: ``` from torchvision import transforms,datasets normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) vgg_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize]) dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), vgg_format) for x in ['train', 'valid']} ``` Hence `len(dsets['train'])` returns `23000`, i.e. the number of images in the training set, and more precisely, the number of files located at `data_dir/train/`. Recall that `data_dir/train/` (and similarly `data_dir/valid/`) is split in two folders: `cats/` and `dogs/`, you can check that each of these folders have `11500` images with: `ls | wc -l` You can recover the classes with `dsets['train'].classes` which returns `['cats', 'dogs']` These are features of the `torchvision.datasets.ImageFolder` Module. -- count: false More importantly, what returns `dsets['train'][0]`? -- count: false Answer: a tuple containing a tensor and a label. --- ## Examples (1) To obtain a dataloader for the training set: ``` train_loader = torch.utils.data.DataLoader(dsets['train'], batch_size=64, shuffle=True, num_workers=6) ``` -- count: false Then, you can use it as follows: ``` for input, label in train_loader: output = model(input) loss = loss_fn(output, label) optimizer.zero_grad() loss.backward() optimizer.step() ....``` --- ## Examples (2) In the first lesson, we first precomputed features and converted them as `numpy` arrays in order to store them. You first need to load the features and the labels: ``` feat_train = load_array(os.path.join(data_dir_colab,'vgg16','feat_train.bc')) labels_train = load_array(os.path.join(data_dir_colab,'vgg16','lbs_train.bc')) ``` and then, you can create a list for your dataset as follows: ``` train_dataset = [[torch.from_numpy(f).type(torch.float),torch.tensor(l).type(torch.double)] for (f,l) in zip(feat_train,labels_train)] ``` A list has a buil-in function `len()` and a `__getitem__()` method, hence this is a valid dataset for PyTorch. -- count: false To create a dataloader, it is as simple as: ``` train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True) ``` --- ## Examples (3) Today, we will make our own dataloader from scratch using a Python iterator: ``` def minibatch(batch_size, *tensors): if len(tensors) == 1: tensor = tensors[0] for i in range(0, len(tensor), batch_size): yield tensor[i:i + batch_size] else: for i in range(0, len(tensors[0]), batch_size): yield tuple(x[i:i + batch_size] for x in tensors) ``` -- count: false We will use it as follows: ``` for (minibatch_num, (batch_user, batch_item, batch_rating)) in enumerate(minibatch(batch_size, user_ids_tensor, item_ids_tensor, ratings_tensor)): ... ``` where the user and item ids are obtained from `numpy` integers and ratings from `numpy` floats as follows: ``` user_ids_tensor = torch.from_numpy(users) item_ids_tensor = torch.from_numpy(items) ratings_tensor = torch.from_numpy(ratings) ``` --- ## To know more about dataloading have a look at the related [PyTorch tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) -- count: false ## Now: how to deal with symbolic data like ids? --- # One-hot encoding Encoding colors: .center.width-40[![](images/one-hot.jpg)] Why not simply use: blue = 0, red = 1 and so on? What are the Pros/Cons of such a representation? -- count: false - Each axis has a meaning - Sparse, discrete representation with large dimension (size of the vocabulary). - Symbols are equidistant from each other with euclidean distance $\sqrt{2}$ for large vocabulary. --- # Embeddings Idea: project in $\mathbb{R}^d$ with $d$ much smaller than the size $n$ of the vocabulary! -- count: false More formaly, let $W\in \mathbb{R}^{n\times d}$, we define $$ \text{embedding}(x) = \text{onehot}(x) W $$ $W$ is typically randomly initialized, then tuned by backprop. $W$ are trainable parameters of the model. -- count: false - Continuous and dense representation. - Can represent a huge vocabulary in low dimension. - Axis have no meaning a priori but once trained, embedding metric can capture semantic distance. --- # Example: embeddings for movies in recommender systems .center.width-50[![](images/Pca_viz.png)] PCA of embeddings for movies learned in this [practical](https://github.com/mlelarge/dataflowr/blob/master/CEA_EDF_INRIA/03_collaborative_filtering_empty_colab.ipynb) .citation.tiny[source Fast.ai] --- ## Embeddings in PyTorch [Sparse layers](https://pytorch.org/docs/master/nn.html#embedding) in PyTorch: `torch.nn.Embedding(num_embeddings, embedding_dim)` Example: creating embeddings for users ``` embedding_dim = 3 embedding_user = nn.Embedding(total_user_id, embedding_dim) input = torch.LongTensor([[1,2,4,5],[4,3,2,0]]) embedding_user(input) ``` -- count: false returns ``` tensor([[[-1.7219, 0.4682, 0.5729], [ 0.0657, -0.7046, -0.4397], [ 1.1512, 1.3380, -1.0272], [ 1.4405, 1.2945, 1.0051]], [[ 1.1512, 1.3380, -1.0272], [-0.6698, -2.1828, -0.1695], [ 0.0657, -0.7046, -0.4397], [-0.7966, -1.6769, -0.7454]]]) ``` -- count: false So now is time to play with embeddings! --- class: end-slide, center count: false The end.