Implicit Product Tagging

Implicit Product Tagging

How To Label With Data You Don’t Have

Problem Statement

To best understand the problem that Implicit Labeling solves, it’s important to clarify what Explicit Labeling is. Fundamentally, any model needs data X to predict the output Y. For example, given a series of transactions (let’s call these X), we can tell whether or not they are Fraudulent/Okay (our label Y). Since we capture both sets of data, we can develop, train, and deploy a fraud model that can look at payments data and tell us what’s fraudulent and what isn’t.

What happens when we don’t have label Y? For example, we want to measure alcohol sales on the Square platform. Sure, we have the names of items sold, but we don’t have a flag like isAlcohol='Y'. Sure, we can write some rules that say, “If this item has ‘gin’, ‘mezcal’, or ‘beer’, let’s label it as an alcoholic beverage,” but that type of ruling is time-expensive and might require subject-matter expertise. We need a better way to associate item X and a potential label Y (that we don’t have).

Implicit Labeling

There are actually a series of transformers that do something called Zero Shot Classification that specialize in solving this type of problem. In summary, you give these transformers a sequence of words, such as an item name X, and you give it a set of potential labels Y.

Example

Item: “Szechuan Wontons”

Labels: ["Spicy," "Not Spicy"]

Scores: [0.76, 0.24]

In the example above, our item is Szechuan Wontons, and we want to label it as “Spicy” or “Not Spicy.” By running the item and label through a Zero-Shot Classification model, it would determine that the item is more likely to be spicy than not, probably because the token Szechuan is associated with “Spicy.” In practice, instead of using “Not Spicy”, we will use a blank space as our null label. The reason we do this is because “ ” has no embedded value to it to bias a response. Our null label is probably as important as our label, but we’ll get into that later.

For simplicity, the label we’re going to try to assign is coffee, because we assume most people can follow what is coffee, what isn’t coffee, and where things get nuanced. You might be able to capture coffee with a set of keywords, but again, we’re oversimplifying here so that it’s easier to follow.

Is It Coffee?

The goal of this exercise is to create a high-quality dataset that has accurately labeled values for items, and whether or not they are coffee. To start, you’re going to want to install transformers and torch, if you haven’t already.

pip install transformers
pip install torch

We’re also going to want to import these with a couple other packages to make this run smoothly.

import pandas as pd
import os 
import torch
import seaborn as sns

The transformer we’re going to use for this exercise is Facebook’s BART (facebook/bart-large-mnli). I’d encourage you to check this list of models on HuggingFace if you’re working with multiple languages (BaptisteDoyen/camembert-base-xnli is specifically designed for FR language, so I would defer to use that if you we’re doing this in French). Let’s load our model and set our classification label.

from transformers import pipeline

if torch.cuda.is_available(): classifier = pipeline("zero-shot-classification", 
                                                     model="facebook/bart-large-mnli", 
                                                     device=0)  
else: classifier = pipeline("zero-shot-classification", 
                             model="facebook/bart-large-mnli")

classification_label = 'coffee'
null_label = ' '
candidate_labels = [classification_label, null_label]

Notice we call on torch.cuda.is_available to check if we have an NVIDIA GPU enabled. While you don’t need a GPU to run this pipeline, it will be prohibitively slow on CPU. To illustrate, running one example on GPU is less than one second; on CPU it takes about 14 minutes. If you don’t have access to a GPU, try using the default GPU instance on Colab (even the free tiers will have GPU availability).

Let’s try some examples — first, with something obvious.

sequence_to_classify = "cappuccino"
classifier(sequence_to_classify, candidate_labels)

# Output
## {'sequence': 'cappuccino',
##  'labels': ['coffee', ' '],
##  'scores': [0.8640812039375305, 0.13591882586479187]}

Okay, looks like cappuccino is, in fact, coffee. Astounding performance. Let’s try something else.

sequence_to_classify = "wine"
classifier(sequence_to_classify, candidate_labels)

## Output
# {'sequence': 'wine',
#  'labels': [' ', 'coffee'],
#  'scores': [0.9873257279396057, 0.012674303725361824]}

Okay, wine is definitely not coffee. Still good. Note that the order of the labels has flipped. Keep in mind that his model will return the labels in order of their scores. This is annoying for binary classification, but is super useful when you might have three possible labels. Let’s try something else.

sequence_to_classify = "Garlic Bread"
classifier(sequence_to_classify, candidate_labels)

# # Output
# {'sequence': 'Garlic Bread',
#  'labels': [' ', 'coffee'],
#  'scores': [0.9944584369659424, 0.00554158678278327]}

Okay, let’s throw a curveball. Can we do the same thing in Japanese? Let’s give it a try. For reference, カプチーノ means cappuccino and コーヒー means coffee.

sequence_to_classify = "カプチーノ"
candidate_labels = ['コーヒー', ' ']
classifier(sequence_to_classify, candidate_labels)

# # Output
# {'sequence': 'カプチーノ',
#  'labels': [' ', 'コーヒー'],
#  'scores': [0.9102840423583984, 0.08971594274044037]}

Ouch, so it looks like the null label won here. We might need to adjust our null label. Let’s try again, but setting the null label to ヌル (null in Japanese).

sequence_to_classify = "カプチーノ"
candidate_labels = ['コーヒー', 'ヌル']
classifier(sequence_to_classify, candidate_labels)

# # Output
# {'sequence': 'カプチーノ',
#  'labels': ['コーヒー', 'ヌル'],
#  'scores': [0.5033012628555298, 0.4966987669467926]}

Okay, so this null_label isn’t perfect, but we’re going in the right direction. I don’t know Japanese, but the point to highlight here is make sure that your null label is reflective of the use case at hand. You might need to experiment with a few before you’re ready to go.

Is This Coffee? What About This? How About This?

Okay, so we have a decently reliable way to label coffee for our items.

Now, we’re going to start labeling. However, I do not recommend running this on more than 10K records. Counterintuitively, more data is not good here.

Q: What? Why wouldn’t we want to run this on everything?

There are three reasons you’re not going to run this Zero-Shot on every record you have.

  1. At scale Zero-Shot Classification can take a really long time. It takes about one minute to process 1,000 items. That means if you have 100,000 records, you’re going to be running that model for about an hour and a half. You will also timeout of the free version of Colab.
  2. As we’ll see in this article, Zero-Shot Classification is not perfect, and we can actually improve our data by going through some of the undecided labels.
  3. If you decide to supervise the results of the labeling and add corrections where needed, you will improve the quality of your dataset drastically. On a sample of 4,000 records, I probably spent about two minutes correcting labels that were undecided. If you run this on 100K records, have some friends (and some coffee) to go through the labeling corrections.

Q: But I actually need to label more than 10K records. What do I do?

If you need to label more than 10K records, I recommend you use the final dataset from this exercise and train a transformer to label the rest for you. We won’t cover that in this article, but we might in an upcoming article.

Before we let our Zero-Shot model rip on our entire dataset, let’s run this model on a sample of about 4,000 records. To clarify, we’re going to take 4,000 unique item names from U.S., UK, Australian, and Irish items from food and beverage sellers.

english_items = df[df.COUNTRY_CODE.apply(lambda x: x  in ['US', 'AU', 'GB', 'IE'])]
item_names = english_items.NAME.unique()
sample = english_items.sample(4000).NAME.unique()

# Run the sample through the classifier
sample_labels = [classifier(x, candidate_labels) for x in sample]

# This took 3m 43s on a GPU

For each item we’re going to want to assign the probability of our classficiation_label. In this case we’re going to create a column called coffee_score, which gives us the relative score that the item is coffee.

sample_output = pd.DataFrame({'name': sample, classification_label+'_score': label_probs})

Let’s take a look at the score distribution to see how the sample was labeled.

Implict Product Tagging - The Corner Article

As we’d expect, the vast majority of items are not coffee, so it looks like our items have been labeled properly. Let’s take a look at different score tranches to see where the results start looking a lot less like coffee.

sample_output[sample_output[classification_label+'_score'] < 0.75].sort_values(by=classification_label+'_score', ascending=False).head(10)

The top coffee_score results aren’t surprising. They have the classification label in their item name, so they’re obviously going to get covered.

Item Name Coffee Score
1 large coffee 0.9601255655288696
2 coffee 0.9014308452606201
black coffee 0.8894062042236328

Going to the next tranche (coffee_score < 0.75), we’re seeing terms for coffee, such as mocha and latte. On to the next tranche.

Item Name Coffee Score
bio-coffee .747865617275238
coffee and cookies 0.7452135682106018
milk latte 0.7413088083267212
mocha latte 0.7153729200363159

Going to scores under 0.50, we’re starting to see why this model is so valuable. We’re seeing a lot of items that traditional methods such as keywording would not have included. Furthermore, we can see banoffee, which is a type of coffee, is included here. Here is where the nuances kick in. Depending on the need of your model, you can either attribute it as coffee or not.

Item Name Coffee Score
banoffee 0.49095621705055237
biscoffee 0.490482360124588
espresso 0.4893379807472229
latte 0.4725055992603302

Based on the histogram and the scores below, it looks like we’re not getting strong matches under 0.30, so we can call these a wash and assign them an N for isCoffee.

Item Name Coffee Score
sunday roast 0.2864392101764679
brewed hops 0.27759650349617004
diet coke 0.27755919098854065

After investigating the scores, I feel confident in setting an initial rule that says that if coffee_score < 0.30, isCoffee is 0; if coffee_score ≥ 0.60, isCoffee is 1. Everything in the middle will be undecided, and we’re going to export this dataset and fill in the items manually. For coffee, virtually anyone can go in and label what’s coffee and what isn’t, but for more complex items, you might want to limit your data size so you have to do less research.

Remember, 50 great data points are worth more than 10K average ones. The tighter your labeling is to the truth, the better your downstream models will be.

Table Of Contents