Pinterest’s Engineering team has recently posted their research in fashion recommendation, which capture my attention because of its practicality. In this post, they showed a new task in fashion recommendation called Complete The Look. Attempting to explore this task is great for any machine learning practitioner. There are some ambiguous terms that could make readers misunderstand. I will try to explain these terms in Part 1 and show how I reimplement the fashion recommender system in the next part.
When reaching the end of this tutorial, I hope you could understand:
- What is Complete The Look
- What is Style Embedding Architecture
- What is the compatibility distance
You can read the original paper here.
What is Complete The Look?
The first thing I want to notice is that Complete The Look is not a method, this is the new efficient approach for fashion recommendation.
According to Pinterest team, we have the definition of Complete The Look as follows:
Given a scene image and a product image, compute a quantitative measure of distance such that the distance measure reflects visual complementarity between the scene and the product. Such a distance measure can be used either by a binary classifier or by a re-ranker
To say few words for this definition, we should understand Complete The Look is a task that find the compatible products with a scene rather than among products. This is an intuitive reason why we called this task “Complete The Look”.
Traditional approaches have a big disadvantage is that they only compare the product images in the store. These products are often on a blank background so that they do not show us how the products look like in real life. Moreover, it could be a hug waste if we can not take advantage of the images that users post on social networks. Complete The Look is the task Pinterest proposed to resolve this problem.
To measure the compatibility between products and scenes, Pinterest proposed a new neural network call Style Embedding. After the training, Style Embedding will us the unified representation of scene and products.
Style Embedding uses the features from intermediate layers of ResNet50. The reason why we should use ResNet50 is said to be transferable and strong at capturing local features.
ResNet50 features are passed through a feed forward network to map the scene and the products to the same space. There are several things we need to know the embedding. These are the entire scene image, the local regions of the scene and the product image.
There are 3 types neural networks that give us 3 types of embedding. They have similar as the image above but with different hyperparameters.
After achieving the visual embedding of images, we should compute the compatibility distance.
There are some types of distance we need to consider:
- Global distance
- Local distance
- Hybrid distance
First, we need to see how the product is compatible with the scene globally. This is similar to put the two images side by side and then give them a glimpse. The global distance is simply the difference between the scene image and the product image.
Here s, p are denoted for scene and product images. fs, fp are the embedding of the scene and the product.
After see two images globally, we should look at the detail.
The local distance is an attention-based metric to measure the compatibility.
We will the cropping layer to crop the scene into some local regions.
The attention weights help the system focus on where the items is likely to appear. These weights are computed by computing the difference of the category image and each region of the scene image. Then we need to scale the weights to range (0,1) by the softmax function.
One thing we should notice is that ê is the embedding the correct product image.
The local distance is the weighted sum of the difference between the product image and each region of the scene.
The compatibility should depend on how the products could with the local regions and the entire scene. Therefore, the hybrid distance is the mean of the global distance and the local distance.
Pinterest use a triplet loss to train the Style Embedding neural networks. This loss is described as follows:
To train the Style Embedding model, we need a triplet of scene image, positive image, and negative image. Positive image means the compatible product with the scene, whereas negative image is the incompatible product. The training process tries to minimize the distance between the scene and the positive product, and move the negative image far from the scene. The margin α let us know the gap between the positive distance and the negative distance. This loss should not be less than zero.
This is all for the first part. I hope that you understood basically how the Pinterest solve the problem of Complete The Look. Fashion recommendation is an active topic now, and this is a pie for only one person. The playground is still waiting for more efficient methods. Complete The Look is the new attempt of Pinterest. In the next part, I will show you how to implement the fashion recommender of Pinterest.